Last Comment Bug 546470 - Rebuild dm-webtools02
: Rebuild dm-webtools02
05/06/2010 @ 7pm
Product: Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All All
-- enhancement (vote)
: ---
Assigned To: Justin Dow [:jabba]
: matthew zeier [:mrz]
: 546505 (view as bug list)
Depends on:
Blocks: 563531
  Show dependency treegraph
Reported: 2010-02-16 11:05 PST by Ryan Flint [:rflint] (ping via IRC for reviews)
Modified: 2015-03-12 08:17 PDT (History)
14 users (show)
jdow: needs‑downtime+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Description User image Ryan Flint [:rflint] (ping via IRC for reviews) 2010-02-16 11:05:28 PST
$ ping
PING ( 56(84) bytes of data.
--- ping statistics ---
21 packets transmitted, 0 received, 100% packet loss, time 20089ms
Comment 1 User image Phong Tran [:phong] 2010-02-16 11:33:14 PST
it has a failed drive.  i am looking into it.
Comment 2 User image Dave Miller [:justdave] ( 2010-02-16 12:02:56 PST
And it seems the RAID was striped, not mirrored.  The box is toast.

Fortunately the mail split was active, so tinderbox-stage has up-to-date data
on it.  I've changed DNS to point at tinderbox-stage's instance for now.

The production box is likely going to have to be rebuilt from scratch.
Comment 3 User image Phong Tran [:phong] 2010-02-16 13:31:56 PST
rebuilt from scratch.
Comment 4 User image chizu 2010-02-16 14:07:01 PST
*** Bug 546505 has been marked as a duplicate of this bug. ***
Comment 5 User image Dave Miller [:justdave] ( 2010-02-16 14:37:07 PST
chassis is DOA.  I suspect the array controller.  Was painfully slow watching puppet load the basics onto the machine, I rebooted when it was done and it didn't come back.  I went into the console to find it sitting at the PXE prompt again.  Another reboot provided these on the POST screen:

1783-Slot 0 Drive Array Controller Failure                                      
     [Init failure (cmd=A5h, err=20h)]                                          
1783-Slot 0 Drive Array Controller Failure!                                     
     [Command failure (cmd=B1h, err=00h)]                                       

I bet the original drives are fine and the array controller is shot on that box.

Can we try putting the original drives in a different box and see if they work?
Comment 6 User image Dave Miller [:justdave] ( 2010-02-16 14:39:21 PST
Found a message on HP's forums with someone reporting a similar error message, they were instructed to re-seat the array controller card.  They didn't report back whether that solved the problem or not (last message in the thread is dated a week ago).
Comment 7 User image Nick Thomas [:nthomas] 2010-02-16 19:56:34 PST
FTR, the mail processing problems with tinderbox.m.o were resolved a few hours ago by justdave. Apparently bonsai is still down and so CVS trees should all be closed, but it's not letting me do that with the sheriff password.

Updating the summary.
Comment 8 User image Dave Miller [:justdave] ( 2010-02-16 20:28:55 PST
Bonsai's been up for a few hours actually... are you having problems with it still?
Comment 9 User image Nick Thomas [:nthomas] 2010-02-16 20:33:14 PST
No, just confused by the lack of updates here.
Comment 10 User image Smokey Ardisson (offline for a while; not following bugs - do not email) 2010-02-16 20:47:45 PST
(In reply to comment #8)
> Bonsai's been up for a few hours actually... are you having problems with it
> still?

Bonsai's blame|log|diff|graph functions all think the world stopped somewhere in about October 2008, though.
Comment 11 User image Dave Miller [:justdave] ( 2010-02-16 20:56:04 PST
aravind started ignoring me when I asked for a backup restore for bonsai several hours ago...   I just started a cvs history rebuild, it'll probably take 5 or 6 hours to run, but that should get it all straightened out eventually.
Comment 12 User image Dave Miller [:justdave] ( 2010-02-17 13:27:29 PST
And that comment got his attention. ;)  Backup was restored last night which got us the changelog data up through Feb 4, also discovered a broken cron job updating the local copy of the cvs repository that bonsai uses the generate the diffs, and that's been fixed, so bonsai should be working now.
Comment 13 User image matthew zeier [:mrz] 2010-02-24 10:12:07 PST
Need to work with HP on replacement.
Comment 14 User image Ben Hearsum (:bhearsum) 2010-02-24 10:13:27 PST
Could this be the reason we've been seeing Tinderbox stall for periods of time? Eg, 20 minutes between "mail sent" and "build shows up on tinderbox".
Comment 15 User image matthew zeier [:mrz] 2010-02-24 11:20:13 PST
Maybe, maybe not.  We moved the services on 02 to a different machine.
Comment 16 User image Dave Miller [:justdave] ( 2010-03-13 01:03:37 PST
Ticket filed with HP.
Comment 17 User image Phong Tran [:phong] 2010-03-13 13:16:46 PST
Just have them send the part to MPT and I can replace it myself.
Comment 18 User image Dave Miller [:justdave] ( 2010-03-15 07:01:44 PDT
HP needs to know the model of array controller that's in it.  It appears that it's completely powered off (ilo and all) right now, so I can't check remotely.

They said for that model machine it should be either a 6i or a 6402, but they need to know which.
Comment 19 User image Dave Miller [:justdave] ( 2010-03-15 18:40:37 PDT
Part shipped, I think.  Not quite sure where it shipped to, and it's under one of these two case numbers.  Probably find out when I get email confirmations in the next couple hours.  Got a support rep trying to verify if with someone in shipping in person if they can still snag it, but they shipped to the wrong address on the first attempt.
Comment 20 User image Dave Miller [:justdave] ( 2010-03-17 05:27:49 PDT
So they obviously *still* can't get it straight.  It appears the new array controller got shipped to Castro St now instead of MPT.  This is still better than shipping it to me like they tried to do the first time.  I'm not spending another 45 minutes on the phone to fix it, someone will have to cart it to MPT from the office. :)
Comment 21 User image Phong Tran [:phong] 2010-03-17 09:54:32 PDT
I'll grab it from the office and install it.
Comment 22 User image Phong Tran [:phong] 2010-03-18 11:13:57 PDT
system board.  dhcp was updated with new mac address for nic and ilo.
Comment 23 User image Phong Tran [:phong] 2010-03-18 12:46:03 PDT
unfortunately, we couldn't recover data from the old drive.
Comment 24 User image Dave Miller [:justdave] ( 2010-03-19 01:21:25 PDT
Box has been kickstarted.  Turns out we have a failed drive for real (probably why you couldn't recover data).  Updated the existing ticket with the failed drive info, since HP hadn't closed it yet.
Comment 25 User image Dave Miller [:justdave] ( 2010-03-19 14:44:22 PDT
New drive has shipped.  Directly to MPT this time. :)
Comment 26 User image Phong Tran [:phong] 2010-03-23 10:09:01 PDT
drive replaced.
Comment 27 User image Dave Miller [:justdave] ( 2010-03-25 22:45:26 PDT
So dm-webtools02 is now a blank slate.  dm-webtools04 (where tinderbox and bonsai are running now) used to be the staging box for tinderbox and bonsai.  It got promoted when dm-webtools02 died.  Should we make dm-webtools02 the new staging box?  Or move production back to it?  dm-webtools04 is also the production box for MXR, so it probably makes sense to get production bonsai and tinderbox off to avoid competing for CPU...
Comment 28 User image Nick Thomas [:nthomas] 2010-03-25 23:58:13 PDT
Would be great to have tinderbox-stage back, if only to test things like bug 545825 before they hit production.
Comment 29 User image Justin Dow [:jabba] 2010-05-04 14:59:09 PDT
I'd like to schedule switching tinderbox and bonsai production back over to dm-webtools02 during Thursday's downtime window. Does that work for everyone?
Comment 30 User image Nick Thomas [:nthomas] 2010-05-06 15:08:22 PDT
Is dm-webtools02 ready to go ? Could we CNAME tinderbox-stage.m.o to it to make sure all is well before we cut it over ?
Comment 31 User image Justin Dow [:jabba] 2010-05-06 15:56:19 PDT
No, it requires moving an iscsi mount, which is why we need the downtime tonight.
Comment 32 User image Justin Dow [:jabba] 2010-05-06 21:51:12 PDT has been moved back to dm-webtools02 and is set up on dm-webtools04 now.

Note You need to log in before you can comment on or make changes to this bug.