Closed
Bug 546470
Opened 14 years ago
Closed 14 years ago
Rebuild dm-webtools02
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rflint, Assigned: jabba)
References
Details
(Whiteboard: 05/06/2010 @ 7pm)
$ ping tinderbox.mozilla.org PING dm-webtools02.mozilla.org (63.245.208.148) 56(84) bytes of data. ^C --- dm-webtools02.mozilla.org ping statistics --- 21 packets transmitted, 0 received, 100% packet loss, time 20089ms
Comment 1•14 years ago
|
||
it has a failed drive. i am looking into it.
Assignee: server-ops → phong
Flags: colo-trip+
Comment 2•14 years ago
|
||
And it seems the RAID was striped, not mirrored. The box is toast. Fortunately the mail split was active, so tinderbox-stage has up-to-date data on it. I've changed DNS to point at tinderbox-stage's instance for now. The production box is likely going to have to be rebuilt from scratch.
Comment 5•14 years ago
|
||
chassis is DOA. I suspect the array controller. Was painfully slow watching puppet load the basics onto the machine, I rebooted when it was done and it didn't come back. I went into the console to find it sitting at the PXE prompt again. Another reboot provided these on the POST screen: 1783-Slot 0 Drive Array Controller Failure [Init failure (cmd=A5h, err=20h)] 1783-Slot 0 Drive Array Controller Failure! [Command failure (cmd=B1h, err=00h)] I bet the original drives are fine and the array controller is shot on that box. Can we try putting the original drives in a different box and see if they work?
Comment 6•14 years ago
|
||
Found a message on HP's forums with someone reporting a similar error message, they were instructed to re-seat the array controller card. They didn't report back whether that solved the problem or not (last message in the thread is dated a week ago).
Comment 7•14 years ago
|
||
FTR, the mail processing problems with tinderbox.m.o were resolved a few hours ago by justdave. Apparently bonsai is still down and so CVS trees should all be closed, but it's not letting me do that with the sheriff password. Updating the summary.
Severity: blocker → major
Summary: Tinderbox isn't responding → Array controller on dm-webtools02 died
Comment 8•14 years ago
|
||
Bonsai's been up for a few hours actually... are you having problems with it still?
Comment 9•14 years ago
|
||
No, just confused by the lack of updates here.
(In reply to comment #8) > Bonsai's been up for a few hours actually... are you having problems with it > still? Bonsai's blame|log|diff|graph functions all think the world stopped somewhere in about October 2008, though.
Comment 11•14 years ago
|
||
aravind started ignoring me when I asked for a backup restore for bonsai several hours ago... I just started a cvs history rebuild, it'll probably take 5 or 6 hours to run, but that should get it all straightened out eventually.
Comment 12•14 years ago
|
||
And that comment got his attention. ;) Backup was restored last night which got us the changelog data up through Feb 4, also discovered a broken cron job updating the local copy of the cvs repository that bonsai uses the generate the diffs, and that's been fixed, so bonsai should be working now.
Comment 13•14 years ago
|
||
Need to work with HP on replacement.
Severity: major → enhancement
Whiteboard: [Need HP case]
Comment 14•14 years ago
|
||
Could this be the reason we've been seeing Tinderbox stall for periods of time? Eg, 20 minutes between "mail sent" and "build shows up on tinderbox".
Comment 15•14 years ago
|
||
Maybe, maybe not. We moved the services on 02 to a different machine.
Comment 17•14 years ago
|
||
Just have them send the part to MPT and I can replace it myself.
Comment 18•14 years ago
|
||
HP needs to know the model of array controller that's in it. It appears that it's completely powered off (ilo and all) right now, so I can't check remotely. They said for that model machine it should be either a 6i or a 6402, but they need to know which.
Comment 19•14 years ago
|
||
Part shipped, I think. Not quite sure where it shipped to, and it's under one of these two case numbers. Probably find out when I get email confirmations in the next couple hours. Got a support rep trying to verify if with someone in shipping in person if they can still snag it, but they shipped to the wrong address on the first attempt.
Whiteboard: [HP:4611762532] → [HP:4611762532][HP:4611847502]
Comment 20•14 years ago
|
||
So they obviously *still* can't get it straight. It appears the new array controller got shipped to Castro St now instead of MPT. This is still better than shipping it to me like they tried to do the first time. I'm not spending another 45 minutes on the phone to fix it, someone will have to cart it to MPT from the office. :)
Comment 22•14 years ago
|
||
system board. dhcp was updated with new mac address for nic and ilo.
Comment 23•14 years ago
|
||
unfortunately, we couldn't recover data from the old drive.
Assignee: phong → justdave
Comment 24•14 years ago
|
||
Box has been kickstarted. Turns out we have a failed drive for real (probably why you couldn't recover data). Updated the existing ticket with the failed drive info, since HP hadn't closed it yet.
Whiteboard: [HP:4611762532][HP:4611847502] → [HP:4611762532]
Comment 25•14 years ago
|
||
New drive has shipped. Directly to MPT this time. :)
Comment 26•14 years ago
|
||
drive replaced.
Comment 27•14 years ago
|
||
So dm-webtools02 is now a blank slate. dm-webtools04 (where tinderbox and bonsai are running now) used to be the staging box for tinderbox and bonsai. It got promoted when dm-webtools02 died. Should we make dm-webtools02 the new staging box? Or move production back to it? dm-webtools04 is also the production box for MXR, so it probably makes sense to get production bonsai and tinderbox off to avoid competing for CPU...
Comment 28•14 years ago
|
||
Would be great to have tinderbox-stage back, if only to test things like bug 545825 before they hit production.
Updated•14 years ago
|
Assignee: justdave → jdow
Summary: Array controller on dm-webtools02 died → Rebuild dm-webtools02
Whiteboard: [HP:4611762532]
Assignee | ||
Comment 29•14 years ago
|
||
I'd like to schedule switching tinderbox and bonsai production back over to dm-webtools02 during Thursday's downtime window. Does that work for everyone?
Flags: needs-downtime+
Updated•14 years ago
|
Whiteboard: 05/06/2010 @ 7pm
Comment 30•14 years ago
|
||
Is dm-webtools02 ready to go ? Could we CNAME tinderbox-stage.m.o to it to make sure all is well before we cut it over ?
Assignee | ||
Comment 31•14 years ago
|
||
No, it requires moving an iscsi mount, which is why we need the downtime tonight.
Assignee | ||
Comment 32•14 years ago
|
||
tinderbox.mozilla.org has been moved back to dm-webtools02 and tinderbox-stage.mozilla.org is set up on dm-webtools04 now.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•