Closed Bug 423816 Opened 12 years ago Closed 12 years ago

tinderbox cb-xserve03 unreachable after reboot

Categories

(Release Engineering :: General, defect)

PowerPC
macOS
defect
Not set

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: ause, Unassigned)

Details

ssh login from jumphost fails with "no route to host" which, as far as i understand, is rather no problem of the machine itself
Blocks: 420840
Phong - box just needs a reboot.  It's in 101.05-9.
Assignee: server-ops → phong.tran
Flags: colo-trip+
server won't boot up.  it is stuck at the apple screen.
Status: NEW → ASSIGNED
I was able to boot it to the cd after a very long wait.  I am leaving it here.
Phong, could you please help us to get the machine up and running again? Ause mentioned it's still *not* accessible from outside. We're just about to release the next calendar version and this whole outage hits us hard...
i still can not reach it and have no physical access to that machine. i even
have no clue where this box may be located physically. it's the tinderbox we
need to do our 0.8 release which was scheduled this week.
any suggestion on how to proceed?
Severity: normal → critical
Priority: -- → P1
<lowering the sev as not to page people>

This is a hardware issue from the sounds of it, and it will have to be sent in to apple for repair (usually 5-7 days), assuming it is still under warranty.  Unfortunately this won't be a quick fix.
Severity: critical → major
Apologies if I'm telling folks things they already know but before sending it in, the following might help:

1. Attempt to boot with a keyboard connected (DIRECTLY... not via a KVM switch), holding down Shift while booting.  If this works, it will boot in Safe Mode, and you can run Disk First Aid from there.

2. If that doesn't work, attempt to boot with Apple-V held down.  This is verbose mode and should allow you to see kernel messages, etc. Perhaps you can see where it's hanging up.

3. Once booted from the OSXS install CD, run Disk Utility, and First Aid on it to see if there are errors.  If so, attempt to fix them.  If Disk First Aid can't fix it, try Alsoft DiskWarrior.

4. Run the Apple Hardware Diagnostics CD.  It usually can find what's broke.

5. Put the drives in another Xserve and see if it boots there.

6. Put each drive of the mirror separately in another Xserve and see if it boots there.
One more thing:
If the mirror is still intact, but just the OS is borked, don't wipe both mirror drives when installing, so we can attempt to copy over all the tinderbox bits, rather than having to do so from scratch.
> 1. Attempt to boot with a keyboard connected (DIRECTLY... not via a KVM
> switch), holding down Shift while booting.  If this works, it will boot in Safe
> Mode, and you can run Disk First Aid from there.
> 
> 2. If that doesn't work, attempt to boot with Apple-V held down.  This is
> verbose mode and should allow you to see kernel messages, etc. Perhaps you can
> see where it's hanging up.
> 
> 3. Once booted from the OSXS install CD, run Disk Utility, and First Aid on it
> to see if there are errors.  If so, attempt to fix them.  If Disk First Aid
> can't fix it, try Alsoft DiskWarrior.

All done - no help, no disk errors. 

> 4. Run the Apple Hardware Diagnostics CD.  It usually can find what's broke.

We could do this, but still has to go to apple.

> 5. Put the drives in another Xserve and see if it boots there.
> 
> 6. Put each drive of the mirror separately in another Xserve and see if it
> boots there.

All others are in use - no spares - do you have a spare to try (or someone in the community)?

Also, your next comment is a good one.  If we have to go down that route, will do.

Justin, has the machine been sent to Apple yet?  We really need a quick turn around on this, since it is blocking the next calendar release (supposed to have released *this* week).

Thanks for your help.
This machine is defined as having tier3 support per (http://wiki.mozilla.org/Build:Farm) - we have a lot else going on with betas/outages, so may be sometime next week before we get this back.

Copying John Oduinn so he's in the loop.
(In reply to comment #11)
> This machine is defined as having tier3 support per
> (http://wiki.mozilla.org/Build:Farm) - we have a lot else going on with
> betas/outages, so may be sometime next week before we get this back.
> 
> Copying John Oduinn so he's in the loop.
> 
We know everyone is busy with the upcoming beta.  Sam S. offered to loan us a tinderbox from the Camino project in the meantime, if that pans out, then this bug won't block the calendar release.  We'll report back here if we can use the tinderbox he has available.
(In reply to comment #11)
> This machine is defined as having tier3 support per
> (http://wiki.mozilla.org/Build:Farm) - we have a lot else going on with
> betas/outages, so may be sometime next week before we get this back.
> 
> Copying John Oduinn so he's in the loop.
aiui, cb-xserve03 is a PPC-based xserve. 

We do have a PPC-based xserve (xserve04) that we just mothballed. Our plan was to let xserve04 sit idle for a little bit in case anyone needed it turned back on urgently, then do backup and reimage so we would have a spare PPC-xserve in case one of ours died. If the loaner from Sam doesnt work out, let me know and we can speed up the backup/reimage/switchnetworks/re-key and let you use xserve04 while Apple do repairs on cb-xserve03... Keep in mind, with all the other releases going on this week, its still going to be at least mid-late next week though.

(We also have an intel-based xserve with 10.5, but not sure thats useful to you).


(...and yuk, lousy timing on the hardware blowout. Between this and the colo fun, it seems like the photocopier gremlins have gone high tech this week.)
Ok, so looks like this may be a corrupt OS image - apple wants a fresh install of the OS - can we do this?  We can't really keep one copy of the drive as you won't have a raid'd device to install the OS onto.  Can you re-create your dev env from a fresh install?


If so, someone can do the install tonight.
to be honest, i simply don't know. i'm not aware of anything magic and it had been done once. so it should be possible again.
on the other hand i havn't setup this machine. lilmatt?
Severity: major → normal
Priority: P1 → --
coop, what's the best way forward here ? Can we use our PPC image and re-scrub ?
(In reply to comment #16)
> coop, what's the best way forward here ? Can we use our PPC image and re-scrub
> ?
> 

Nick: yes, that's what we've done in the past.
As a reminder, the original src ppc image is still available on one of the external FW drives.
The new IP is 10.2.73.252
Ignore that last comment.  I was updating the wrong bug.
The first restore took over 3 hours and failed with 2 files not copied.  I recreated the RAID and trying the restore again.  I will leave it running overnight.  I will come back and check on it tomorrow.
CB-XSERVER03 is up and running again.  What IP address should I assign it?
I have it attached to one of the console in 101.05.
I think you left - any idea what console channel that was?  
IP: 63.245.210.20 / 255.255.255.224
GW: 63.245.210.1

Nameservers:
64.127.100.12
64.235.225.10  
cb-xserve03 is up and running again.
i can connect to that machine now but i can not login a calbld. how to initially access this machine?
reading the comments in this bug, i'm not sure if this machine is now "ready to build up a tinderbox again" or "ready to be given to apple".
anyone got a hint for me?
Email me offline username/password - this is a clone of a build image with their logins.  I'll add that account for you.

(or send me your keys and I'll dump you into root's key file)
(In reply to comment #29)
> Email me offline username/password - this is a clone of a build image with
> their logins.  I'll add that account for you.
> 
> (or send me your keys and I'll dump you into root's key file)
> 
done


resolving
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I don't see cb-xserve03 back on the tinderbox pages, neither trunk nor mozilla1.8 branch. Is there another bug to make it work again?
(In reply to comment #32)
> I don't see cb-xserve03 back on the tinderbox pages, neither trunk nor
> mozilla1.8 branch. Is there another bug to make it work again?
> 

i can login as root now but neither does the user calbld exist nor does vnc work for me (black on black).
also there are lots of complains in system.log and asl.log that reverse dns and hostname do not match which may or may not have something to do with my vnc problem. i found this in windows.log:
Apr 03 04:19:46  [113] kCGErrorCannotComplete: CGXPostNotification2 : Time out waiting for reply from "" for notification type 102 (CID 0x9fff, PID 18851)

i think that's worth reopening...




Status: RESOLVED → REOPENED
Resolution: FIXED → ---
No longer blocks Bug 420840 due to the tinderbox borrowed from the Camino project.
No longer blocks: 420840
VNC works for me on OSX to that box - verified that the other day with you.  


DNS matches to me:
cb-xserve03:/var/log root# host 63.245.210.20
20.210.245.63.in-addr.arpa domain name pointer cb-xserve03.mozilla.com.
cb-xserve03:/var/log root# host  cb-xserve03
cb-xserve03.mozilla.com has address 63.245.210.20
cb-xserve03:/var/log root# host cb-xserve03.mozilla.com
cb-xserve03.mozilla.com has address 63.245.210.20

IIRC, you're not on OSX - can you try a different VNC client?
Blocks: 420840
No longer blocks: 420840
i meanwhile tried four different clients on three different machines (linux and windows). the result is all the same: windows is black on black when limiting protocol version to 3.x, connection lost after some garbage with 4.x.
which account did you use for tunneling? could you try as root, which is currently the only option for me?

regarding dns, logs are full of this messages. no idea what's wrong.

ok, found a mac an tried a mac vnc client (tunneled vnc over vnc...) - just to find me on a display belonging to cltbld and requiring its password for whatever i need.
The standard setup for a community box is
* set root password to the one known by Build and community admins
* setup up a calbld user with the defined password for that account
* set VNC password to the calbld password
* delete the cltbld user

I'll try to catch mrz to do step one, and will then do the rest.
Assignee: phong.tran → nrthomas
Status: REOPENED → NEW
root password's been changed.  Nothing left for IT to do.
Assignee: nrthomas → nobody
Component: Server Operations → Release Engineering
Flags: colo-trip+
QA Contact: justin → release
All done, and also installed calbld keys and fixed up the CVS/Root files in /builds/tinderbox/mozilla.
Status: NEW → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.