Last Comment Bug 546470 - Rebuild dm-webtools02
: Rebuild dm-webtools02
Status: RESOLVED FIXED
05/06/2010 @ 7pm
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All All
: -- enhancement (vote)
: ---
Assigned To: Justin Dow [:jabba]
: matthew zeier [:mrz]
Mentors:
: 546505 (view as bug list)
Depends on:
Blocks: 563531
  Show dependency treegraph
 
Reported: 2010-02-16 11:05 PST by Ryan Flint [:rflint] (ping via IRC for reviews)
Modified: 2015-03-12 08:17 PDT (History)
14 users (show)
jdow: needs‑downtime+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Ryan Flint [:rflint] (ping via IRC for reviews) 2010-02-16 11:05:28 PST
$ ping tinderbox.mozilla.org
PING dm-webtools02.mozilla.org (63.245.208.148) 56(84) bytes of data.
^C
--- dm-webtools02.mozilla.org ping statistics ---
21 packets transmitted, 0 received, 100% packet loss, time 20089ms
Comment 1 Phong Tran [:phong] 2010-02-16 11:33:14 PST
it has a failed drive.  i am looking into it.
Comment 2 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-02-16 12:02:56 PST
And it seems the RAID was striped, not mirrored.  The box is toast.

Fortunately the mail split was active, so tinderbox-stage has up-to-date data
on it.  I've changed DNS to point at tinderbox-stage's instance for now.

The production box is likely going to have to be rebuilt from scratch.
Comment 3 Phong Tran [:phong] 2010-02-16 13:31:56 PST
rebuilt from scratch.
Comment 4 chizu 2010-02-16 14:07:01 PST
*** Bug 546505 has been marked as a duplicate of this bug. ***
Comment 5 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-02-16 14:37:07 PST
chassis is DOA.  I suspect the array controller.  Was painfully slow watching puppet load the basics onto the machine, I rebooted when it was done and it didn't come back.  I went into the console to find it sitting at the PXE prompt again.  Another reboot provided these on the POST screen:

1783-Slot 0 Drive Array Controller Failure                                      
     [Init failure (cmd=A5h, err=20h)]                                          
                                                                                
1783-Slot 0 Drive Array Controller Failure!                                     
     [Command failure (cmd=B1h, err=00h)]                                       

I bet the original drives are fine and the array controller is shot on that box.

Can we try putting the original drives in a different box and see if they work?
Comment 6 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-02-16 14:39:21 PST
Found a message on HP's forums with someone reporting a similar error message, they were instructed to re-seat the array controller card.  They didn't report back whether that solved the problem or not (last message in the thread is dated a week ago).
Comment 7 Nick Thomas [:nthomas] 2010-02-16 19:56:34 PST
FTR, the mail processing problems with tinderbox.m.o were resolved a few hours ago by justdave. Apparently bonsai is still down and so CVS trees should all be closed, but it's not letting me do that with the sheriff password.

Updating the summary.
Comment 8 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-02-16 20:28:55 PST
Bonsai's been up for a few hours actually... are you having problems with it still?
Comment 9 Nick Thomas [:nthomas] 2010-02-16 20:33:14 PST
No, just confused by the lack of updates here.
Comment 10 Smokey Ardisson (offline for a while; not following bugs - do not email) 2010-02-16 20:47:45 PST
(In reply to comment #8)
> Bonsai's been up for a few hours actually... are you having problems with it
> still?

Bonsai's blame|log|diff|graph functions all think the world stopped somewhere in about October 2008, though.
Comment 11 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-02-16 20:56:04 PST
aravind started ignoring me when I asked for a backup restore for bonsai several hours ago...   I just started a cvs history rebuild, it'll probably take 5 or 6 hours to run, but that should get it all straightened out eventually.
Comment 12 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-02-17 13:27:29 PST
And that comment got his attention. ;)  Backup was restored last night which got us the changelog data up through Feb 4, also discovered a broken cron job updating the local copy of the cvs repository that bonsai uses the generate the diffs, and that's been fixed, so bonsai should be working now.
Comment 13 matthew zeier [:mrz] 2010-02-24 10:12:07 PST
Need to work with HP on replacement.
Comment 14 Ben Hearsum (:bhearsum) 2010-02-24 10:13:27 PST
Could this be the reason we've been seeing Tinderbox stall for periods of time? Eg, 20 minutes between "mail sent" and "build shows up on tinderbox".
Comment 15 matthew zeier [:mrz] 2010-02-24 11:20:13 PST
Maybe, maybe not.  We moved the services on 02 to a different machine.
Comment 16 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-03-13 01:03:37 PST
Ticket filed with HP.
Comment 17 Phong Tran [:phong] 2010-03-13 13:16:46 PST
Just have them send the part to MPT and I can replace it myself.
Comment 18 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-03-15 07:01:44 PDT
HP needs to know the model of array controller that's in it.  It appears that it's completely powered off (ilo and all) right now, so I can't check remotely.

They said for that model machine it should be either a 6i or a 6402, but they need to know which.
Comment 19 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-03-15 18:40:37 PDT
Part shipped, I think.  Not quite sure where it shipped to, and it's under one of these two case numbers.  Probably find out when I get email confirmations in the next couple hours.  Got a support rep trying to verify if with someone in shipping in person if they can still snag it, but they shipped to the wrong address on the first attempt.
Comment 20 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-03-17 05:27:49 PDT
So they obviously *still* can't get it straight.  It appears the new array controller got shipped to Castro St now instead of MPT.  This is still better than shipping it to me like they tried to do the first time.  I'm not spending another 45 minutes on the phone to fix it, someone will have to cart it to MPT from the office. :)
Comment 21 Phong Tran [:phong] 2010-03-17 09:54:32 PDT
I'll grab it from the office and install it.
Comment 22 Phong Tran [:phong] 2010-03-18 11:13:57 PDT
system board.  dhcp was updated with new mac address for nic and ilo.
Comment 23 Phong Tran [:phong] 2010-03-18 12:46:03 PDT
unfortunately, we couldn't recover data from the old drive.
Comment 24 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-03-19 01:21:25 PDT
Box has been kickstarted.  Turns out we have a failed drive for real (probably why you couldn't recover data).  Updated the existing ticket with the failed drive info, since HP hadn't closed it yet.
Comment 25 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-03-19 14:44:22 PDT
New drive has shipped.  Directly to MPT this time. :)
Comment 26 Phong Tran [:phong] 2010-03-23 10:09:01 PDT
drive replaced.
Comment 27 Dave Miller [:justdave] (justdave@bugzilla.org) 2010-03-25 22:45:26 PDT
So dm-webtools02 is now a blank slate.  dm-webtools04 (where tinderbox and bonsai are running now) used to be the staging box for tinderbox and bonsai.  It got promoted when dm-webtools02 died.  Should we make dm-webtools02 the new staging box?  Or move production back to it?  dm-webtools04 is also the production box for MXR, so it probably makes sense to get production bonsai and tinderbox off to avoid competing for CPU...
Comment 28 Nick Thomas [:nthomas] 2010-03-25 23:58:13 PDT
Would be great to have tinderbox-stage back, if only to test things like bug 545825 before they hit production.
Comment 29 Justin Dow [:jabba] 2010-05-04 14:59:09 PDT
I'd like to schedule switching tinderbox and bonsai production back over to dm-webtools02 during Thursday's downtime window. Does that work for everyone?
Comment 30 Nick Thomas [:nthomas] 2010-05-06 15:08:22 PDT
Is dm-webtools02 ready to go ? Could we CNAME tinderbox-stage.m.o to it to make sure all is well before we cut it over ?
Comment 31 Justin Dow [:jabba] 2010-05-06 15:56:19 PDT
No, it requires moving an iscsi mount, which is why we need the downtime tonight.
Comment 32 Justin Dow [:jabba] 2010-05-06 21:51:12 PDT
tinderbox.mozilla.org has been moved back to dm-webtools02 and tinderbox-stage.mozilla.org is set up on dm-webtools04 now.

Note You need to log in before you can comment on or make changes to this bug.