Closed Bug 546470 Opened 10 years ago Closed 10 years ago
$ ping tinderbox.mozilla.org PING dm-webtools02.mozilla.org (188.8.131.52) 56(84) bytes of data. ^C --- dm-webtools02.mozilla.org ping statistics --- 21 packets transmitted, 0 received, 100% packet loss, time 20089ms
it has a failed drive. i am looking into it.
Assignee: server-ops → phong
And it seems the RAID was striped, not mirrored. The box is toast. Fortunately the mail split was active, so tinderbox-stage has up-to-date data on it. I've changed DNS to point at tinderbox-stage's instance for now. The production box is likely going to have to be rebuilt from scratch.
rebuilt from scratch.
Assignee: phong → justdave
chassis is DOA. I suspect the array controller. Was painfully slow watching puppet load the basics onto the machine, I rebooted when it was done and it didn't come back. I went into the console to find it sitting at the PXE prompt again. Another reboot provided these on the POST screen: 1783-Slot 0 Drive Array Controller Failure [Init failure (cmd=A5h, err=20h)] 1783-Slot 0 Drive Array Controller Failure! [Command failure (cmd=B1h, err=00h)] I bet the original drives are fine and the array controller is shot on that box. Can we try putting the original drives in a different box and see if they work?
Found a message on HP's forums with someone reporting a similar error message, they were instructed to re-seat the array controller card. They didn't report back whether that solved the problem or not (last message in the thread is dated a week ago).
FTR, the mail processing problems with tinderbox.m.o were resolved a few hours ago by justdave. Apparently bonsai is still down and so CVS trees should all be closed, but it's not letting me do that with the sheriff password. Updating the summary.
Severity: blocker → major
Summary: Tinderbox isn't responding → Array controller on dm-webtools02 died
Bonsai's been up for a few hours actually... are you having problems with it still?
No, just confused by the lack of updates here.
(In reply to comment #8) > Bonsai's been up for a few hours actually... are you having problems with it > still? Bonsai's blame|log|diff|graph functions all think the world stopped somewhere in about October 2008, though.
aravind started ignoring me when I asked for a backup restore for bonsai several hours ago... I just started a cvs history rebuild, it'll probably take 5 or 6 hours to run, but that should get it all straightened out eventually.
And that comment got his attention. ;) Backup was restored last night which got us the changelog data up through Feb 4, also discovered a broken cron job updating the local copy of the cvs repository that bonsai uses the generate the diffs, and that's been fixed, so bonsai should be working now.
Need to work with HP on replacement.
Severity: major → enhancement
Whiteboard: [Need HP case]
Could this be the reason we've been seeing Tinderbox stall for periods of time? Eg, 20 minutes between "mail sent" and "build shows up on tinderbox".
Maybe, maybe not. We moved the services on 02 to a different machine.
Ticket filed with HP.
Whiteboard: [Need HP case] → [HP:4611762532]
Just have them send the part to MPT and I can replace it myself.
HP needs to know the model of array controller that's in it. It appears that it's completely powered off (ilo and all) right now, so I can't check remotely. They said for that model machine it should be either a 6i or a 6402, but they need to know which.
Part shipped, I think. Not quite sure where it shipped to, and it's under one of these two case numbers. Probably find out when I get email confirmations in the next couple hours. Got a support rep trying to verify if with someone in shipping in person if they can still snag it, but they shipped to the wrong address on the first attempt.
Whiteboard: [HP:4611762532] → [HP:4611762532][HP:4611847502]
So they obviously *still* can't get it straight. It appears the new array controller got shipped to Castro St now instead of MPT. This is still better than shipping it to me like they tried to do the first time. I'm not spending another 45 minutes on the phone to fix it, someone will have to cart it to MPT from the office. :)
I'll grab it from the office and install it.
Assignee: justdave → phong
system board. dhcp was updated with new mac address for nic and ilo.
unfortunately, we couldn't recover data from the old drive.
Assignee: phong → justdave
Box has been kickstarted. Turns out we have a failed drive for real (probably why you couldn't recover data). Updated the existing ticket with the failed drive info, since HP hadn't closed it yet.
Whiteboard: [HP:4611762532][HP:4611847502] → [HP:4611762532]
New drive has shipped. Directly to MPT this time. :)
So dm-webtools02 is now a blank slate. dm-webtools04 (where tinderbox and bonsai are running now) used to be the staging box for tinderbox and bonsai. It got promoted when dm-webtools02 died. Should we make dm-webtools02 the new staging box? Or move production back to it? dm-webtools04 is also the production box for MXR, so it probably makes sense to get production bonsai and tinderbox off to avoid competing for CPU...
Would be great to have tinderbox-stage back, if only to test things like bug 545825 before they hit production.
Assignee: justdave → jdow
Summary: Array controller on dm-webtools02 died → Rebuild dm-webtools02
I'd like to schedule switching tinderbox and bonsai production back over to dm-webtools02 during Thursday's downtime window. Does that work for everyone?
10 years ago
Is dm-webtools02 ready to go ? Could we CNAME tinderbox-stage.m.o to it to make sure all is well before we cut it over ?
No, it requires moving an iscsi mount, which is why we need the downtime tonight.
tinderbox.mozilla.org has been moved back to dm-webtools02 and tinderbox-stage.mozilla.org is set up on dm-webtools04 now.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.