546470 - Rebuild dm-webtools02

Ryan Flint [:rflint] (ping via IRC for reviews)

Reporter

Description

•

15 years ago

$ ping tinderbox.mozilla.org PING dm-webtools02.mozilla.org (63.245.208.148) 56(84) bytes of data. ^C --- dm-webtools02.mozilla.org ping statistics --- 21 packets transmitted, 0 received, 100% packet loss, time 20089ms

Phong Tran [:phong]

Comment 1

•

15 years ago

it has a failed drive. i am looking into it.

Assignee: server-ops → phong

Flags: colo-trip+

Dave Miller [:justdave]

Comment 2

•

15 years ago

And it seems the RAID was striped, not mirrored. The box is toast. Fortunately the mail split was active, so tinderbox-stage has up-to-date data on it. I've changed DNS to point at tinderbox-stage's instance for now. The production box is likely going to have to be rebuilt from scratch.

Phong Tran [:phong]

Comment 3

•

15 years ago

rebuilt from scratch.

Assignee: phong → justdave

Dave Miller [:justdave]

Comment 5

•

15 years ago

chassis is DOA. I suspect the array controller. Was painfully slow watching puppet load the basics onto the machine, I rebooted when it was done and it didn't come back. I went into the console to find it sitting at the PXE prompt again. Another reboot provided these on the POST screen: 1783-Slot 0 Drive Array Controller Failure [Init failure (cmd=A5h, err=20h)] 1783-Slot 0 Drive Array Controller Failure! [Command failure (cmd=B1h, err=00h)] I bet the original drives are fine and the array controller is shot on that box. Can we try putting the original drives in a different box and see if they work?

Dave Miller [:justdave]

Comment 6

•

15 years ago

Found a message on HP's forums with someone reporting a similar error message, they were instructed to re-seat the array controller card. They didn't report back whether that solved the problem or not (last message in the thread is dated a week ago).

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

15 years ago

FTR, the mail processing problems with tinderbox.m.o were resolved a few hours ago by justdave. Apparently bonsai is still down and so CVS trees should all be closed, but it's not letting me do that with the sheriff password. Updating the summary.

Severity: blocker → major

Summary: Tinderbox isn't responding → Array controller on dm-webtools02 died

Dave Miller [:justdave]

Comment 8

•

15 years ago

Bonsai's been up for a few hours actually... are you having problems with it still?

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

15 years ago

No, just confused by the lack of updates here.

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 10

•

15 years ago

(In reply to comment #8) > Bonsai's been up for a few hours actually... are you having problems with it > still? Bonsai's blame|log|diff|graph functions all think the world stopped somewhere in about October 2008, though.

Dave Miller [:justdave]

Comment 11

•

15 years ago

aravind started ignoring me when I asked for a backup restore for bonsai several hours ago... I just started a cvs history rebuild, it'll probably take 5 or 6 hours to run, but that should get it all straightened out eventually.

Dave Miller [:justdave]

Comment 12

•

15 years ago

And that comment got his attention. ;) Backup was restored last night which got us the changelog data up through Feb 4, also discovered a broken cron job updating the local copy of the cvs repository that bonsai uses the generate the diffs, and that's been fixed, so bonsai should be working now.

matthew zeier [:mrz]

Comment 13

•

15 years ago

Need to work with HP on replacement.

Severity: major → enhancement

Whiteboard: [Need HP case]

bhearsum@mozilla.com (:bhearsum)

Comment 14

•

15 years ago

Could this be the reason we've been seeing Tinderbox stall for periods of time? Eg, 20 minutes between "mail sent" and "build shows up on tinderbox".

matthew zeier [:mrz]

Comment 15

•

15 years ago

Maybe, maybe not. We moved the services on 02 to a different machine.

Dave Miller [:justdave]

Comment 16

•

15 years ago

Ticket filed with HP.

Whiteboard: [Need HP case] → [HP:4611762532]

Phong Tran [:phong]

Comment 17

•

15 years ago

Just have them send the part to MPT and I can replace it myself.

Dave Miller [:justdave]

Comment 18

•

15 years ago

HP needs to know the model of array controller that's in it. It appears that it's completely powered off (ilo and all) right now, so I can't check remotely. They said for that model machine it should be either a 6i or a 6402, but they need to know which.

Dave Miller [:justdave]

Comment 19

•

15 years ago

Part shipped, I think. Not quite sure where it shipped to, and it's under one of these two case numbers. Probably find out when I get email confirmations in the next couple hours. Got a support rep trying to verify if with someone in shipping in person if they can still snag it, but they shipped to the wrong address on the first attempt.

Whiteboard: [HP:4611762532] → [HP:4611762532][HP:4611847502]

Dave Miller [:justdave]

Comment 20

•

15 years ago

So they obviously *still* can't get it straight. It appears the new array controller got shipped to Castro St now instead of MPT. This is still better than shipping it to me like they tried to do the first time. I'm not spending another 45 minutes on the phone to fix it, someone will have to cart it to MPT from the office. :)

Phong Tran [:phong]

Comment 21

•

15 years ago

I'll grab it from the office and install it.

Assignee: justdave → phong

Phong Tran [:phong]

Comment 22

•

15 years ago

system board. dhcp was updated with new mac address for nic and ilo.

Phong Tran [:phong]

Comment 23

•

15 years ago

unfortunately, we couldn't recover data from the old drive.

Assignee: phong → justdave

Dave Miller [:justdave]

Comment 24

•

15 years ago

Box has been kickstarted. Turns out we have a failed drive for real (probably why you couldn't recover data). Updated the existing ticket with the failed drive info, since HP hadn't closed it yet.

Whiteboard: [HP:4611762532][HP:4611847502] → [HP:4611762532]

Dave Miller [:justdave]

Comment 25

•

15 years ago

New drive has shipped. Directly to MPT this time. :)

Phong Tran [:phong]

Comment 26

•

15 years ago

drive replaced.

Dave Miller [:justdave]

Comment 27

•

15 years ago

So dm-webtools02 is now a blank slate. dm-webtools04 (where tinderbox and bonsai are running now) used to be the staging box for tinderbox and bonsai. It got promoted when dm-webtools02 died. Should we make dm-webtools02 the new staging box? Or move production back to it? dm-webtools04 is also the production box for MXR, so it probably makes sense to get production bonsai and tinderbox off to avoid competing for CPU...

Nick Thomas [:nthomas] (UTC+12)

Comment 28

•

15 years ago

Would be great to have tinderbox-stage back, if only to test things like bug 545825 before they hit production.

matthew zeier [:mrz]

Updated

•

15 years ago

Assignee: justdave → jdow

Summary: Array controller on dm-webtools02 died → Rebuild dm-webtools02

Whiteboard: [HP:4611762532]

Justin Dow [:jabba]

Assignee

Comment 29

•

15 years ago

I'd like to schedule switching tinderbox and bonsai production back over to dm-webtools02 during Thursday's downtime window. Does that work for everyone?

Flags: needs-downtime+

matthew zeier [:mrz]

Updated

•

15 years ago

Whiteboard: 05/06/2010 @ 7pm

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

15 years ago

Blocks: 563531

Nick Thomas [:nthomas] (UTC+12)

Comment 30

•

15 years ago

Is dm-webtools02 ready to go ? Could we CNAME tinderbox-stage.m.o to it to make sure all is well before we cut it over ?

Justin Dow [:jabba]

Assignee

Comment 31

•

15 years ago

No, it requires moving an iscsi mount, which is why we need the downtime tonight.

Justin Dow [:jabba]

Assignee

Comment 32

•

15 years ago

tinderbox.mozilla.org has been moved back to dm-webtools02 and tinderbox-stage.mozilla.org is set up on dm-webtools04 now.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard