Closed
Bug 888674
(foopy87)
Opened 12 years ago
Closed 11 years ago
foopy87 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P3)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Unassigned)
References
Details
(Whiteboard: [buildduty][buildslaves][capacity])
Attachments
(1 file)
7.41 KB,
image/png
|
Details |
foopy87.p7.releng.scl1.mozilla.com didn't come back from the scl1 power outage (bug 888656)
<vinh> dustin: Looks like foopy-87 might be hosed.
Can you take down foopy85.p7.releng.scl1.mozilla.com so I can pop foopy-87 into the slot to test?
We opted to defer that for a few days because we can tolerate a missing foopy.
Comment 1•12 years ago
|
||
Current status: waiting for RMA.
Comment 2•12 years ago
|
||
:arr in order to return capacity and stop releng deploys for foopies from alerting that the scripts can't connect here, I wonder if we have a spare identical system that can hot-swap in here (even if different DNS entry) and then when this RMA is done we just use it place it back in place of whatever spare we end up putting here.
Final setup of the replacement foopy shouldn't be hard.
Flags: needinfo?(arich)
Comment 3•12 years ago
|
||
I presume you aren't using mozpool to request devices yet, which would allow you to use pandas from any rack (we built this into the management software so that the loss of any one machine wouldn't impact capacity significantly)?
As far as I know, iX does not provide spare hardware while we're RMAing machines. I've followed up to ask DCOps in the RMA bug.
As a stopgap, (if you're hard coding things), you could switch to a different set of pandas and their respective foopy, since I know we have an abundance of panda capacity.
Flags: needinfo?(arich)
Comment 4•12 years ago
|
||
yea, overall we're not starving for capacity my primary reason for asking was "is there a spare machine we can get in place to avoid needing to care about this bug anymore" ;-)
And additionally to get it off my radar when doing code deploys, since we get errors about unable to update this foopy (since it exists in our devices.json allocation information).
There are ways around it if there is not a spare though.
Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Comment 5•11 years ago
|
||
Host is down again, can we try and see what is wrong -- and specifically what iX did in the previous RMA:
https://bugzilla.mozilla.org/show_bug.cgi?id=889106#c12
Flags: needinfo?(arich)
Comment 6•11 years ago
|
||
Please coordinate with DCOps for triage/replacement.
Flags: needinfo?(arich)
Comment 7•11 years ago
|
||
There's a thing running that's called foopy87, but it doesn't appear to have any devices running successfully on it. Anyone know the status here?
Flags: needinfo?(bugspam.Callek)
Comment 8•11 years ago
|
||
unsure top of my head, dcops?
Flags: needinfo?(bugspam.Callek) → needinfo?(vhua)
Comment 9•11 years ago
|
||
This node was replaced by iX Systems from bug 889106. Right now I can ping and ssh to it. Did you want me to run a hard disk and memtest on it?
Flags: needinfo?(vhua)
Comment 10•11 years ago
|
||
(In reply to Vinh Hua [:vinh] from comment #9)
> This node was replaced by iX Systems from bug 889106. Right now I can ping
> and ssh to it. Did you want me to run a hard disk and memtest on it?
Please, then please reimage it for extra sanity.
Flags: needinfo?(vhua)
Comment 12•11 years ago
|
||
Memtest and hard disk diagnostics came back clean. I tried to reimage but keeps giving me the error message indicating not enough disk space.
Comment 13•11 years ago
|
||
Rebooting this machine always brings it back to this screen.
Comment 14•11 years ago
|
||
[vle@boris ~]$ /usr/sbin/fping foopy87.p7.releng.scl1.mozilla.com
foopy87.p7.releng.scl1.mozilla.com is alive
[vle@boris ~]$ ssh !$
ssh foopy87.p7.releng.scl1.mozilla.com
The authenticity of host 'foopy87.p7.releng.scl1.mozilla.com (10.12.134.23)' can't be established.
RSA key fingerprint is ee:4d:22:23:1d:b3:3d:78:bf:8d:7e:d1:03:0b:b5:6e.
Are you sure you want to continue connecting (yes/no)?
Reporter | ||
Comment 15•11 years ago
|
||
Nagios is alerting about this box since:
Thu 14:54:59 PDT [4857] foopy87.p7.releng.scl1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Doesn't seem to be related to bug 985129.
Comment 16•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #15)
> Nagios is alerting about this box since:
> Thu 14:54:59 PDT [4857] foopy87.p7.releng.scl1.mozilla.com is DOWN :PING
> CRITICAL - Packet loss = 100%
>
> Doesn't seem to be related to bug 985129.
Certainly not from Bug 985129, I suspect this is a bad machine either has a boot CD in it (per :armen's c#13) or has other hardware problems.
Updated•11 years ago
|
Assignee: nobody → armenzg
Updated•11 years ago
|
Assignee: armenzg → nobody
Comment 17•11 years ago
|
||
Host is online, responding to pings, and SSH'able. Current uptime: 06:47:06 up 8 days, 16:20, 1 user, load average: 0.00, 0.00, 0.00
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 18•11 years ago
|
||
and back offline again :(
re-opening IT bug 992119
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
QA Contact: armenzg → bugspam.Callek
Comment 19•11 years ago
|
||
using devices.json as my guide:
[root@foopy87.p7.releng.scl1.mozilla.com builds]# for i in {0685..0697}; do mkdir panda-$i; chown cltbld.cltbld panda-$i; done
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•