672973 - iX hardware issues in scl1 post heatsink/fan/RAM modifications

Reporter

Description

•

13 years ago

The following ix machines in scl1 still exhibit issues after the heatsink/fan/RAM modifications and should return to iX for diagnosis and repairs.

linux-ix-slave01
linux-ix-slave06
linux-ix-slave33
linux64-ix-slave26
linux64-ix-slave37
w32-ix-slave41

w32-ix-slave23 is already at ix for hardware issues.

The w64-ix machines have not been tested since we are waiting on the windows builder vlan configuration to be complete before imaging those machines.

The following w32 hosts have also not been tested, since it has not been possible to break the boot cycle to get to the bios to change the boot order so that network is first (these will likely require crash carting):

w32-ix-slave01
w32-ix-slave03
w32-ix-slave26
w32-ix-slave29

The included spreadsheet URL will be used to track status.

Zandr Milewski [:zandr]

Assignee

Updated

•

13 years ago

colo-trip: --- → scl1

Zandr Milewski [:zandr]

Assignee

Comment 1

•

13 years ago

(In reply to comment #0)

> linux-ix-slave01    
> linux-ix-slave06    
> linux-ix-slave33    
> linux64-ix-slave26  
> linux64-ix-slave37       

These machines have been pulled, stacked on the cart, and iX notified to pick them up.

> w32-ix-slave41 
I think this might be an entirely blank drive. If there isn't even a partition table, Deploystudio won't see it. I'll investigate this tomorrow.

> w32-ix-slave01
> w32-ix-slave03
> w32-ix-slave26
> w32-ix-slave29

These four machines have been set to netboot in BIOS and I've kicked off reimaging them with the win32-ix-ref-110527 image.

Spreadsheet updated.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

13 years ago

Blocks: 673436

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

13 years ago

Please add linux64-ix-slave21 to the list of machines to go to iX - it is hung with a machine check exception (4 bank 5 and lots of 0's)

Amy Rich [:arr] [:arich]

Reporter

Comment 4

•

13 years ago

Please add linux64-ix-slave36 to the list of machines to check for an initial partition table.  It's also rebooting into ds without finding a disk.

Zandr Milewski [:zandr]

Assignee

Comment 5

•

13 years ago

w32-ix-slave41: wrote partition table, started imaging from w32-ix-ref-110527

linux64-ix-slave36: wrote partition table, started imaging from linux64-ix-ref-110527

linux64-ix-slave21: pulled and delivered to iX.

I haven't put these updates in the spreadsheet yet.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

13 years ago

(updates are in the spreadsheet now)

Amy Rich [:arr] [:arich]

Reporter

Comment 7

•

13 years ago

w32-ix-slave41 also needs to go back to iX.  It errors with "These memory DIMMs are not supported" and then gets into a reboot loop if you hit F1 to continue.

Amy Rich [:arr] [:arich]

Reporter

Updated

•

13 years ago

Assignee: server-ops-releng → zandr

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

13 years ago

Here's the latest from iX on these machines:

Asset # 4620 / A1-16072 - This system is currently under further
diagnosis at the moment.

4625 / A1-16077 - This system had a bad board and we are now processing
the RMA for replacement.

4764 / A1-16163 - This system is under further diagnosis as well
however, we did discover a failed disk. The replacement has been made
(with WD) and testing has resumed.

4799 / A1-16198 - This system had a failed disk as well and the drive
has been replaced (with WD).

4810 / A1-16209 - We are currently diagnosing this box but are looking
at one of the new modules as the culprit. Regardless, we will be
confirming so by tomorrow I believe.

4794 / A1-16193 - This one was off the list but it appears the drive may
be the culprit. Should have confirmation on this by tomorrow as well.

4617 / A1-16069 - This is the repeat offender we've had here 3 times
now. We're certain at this point the board is the culprit and are
waiting on the replacement to arrive. ETA is currently 7/26


We should get them back onsite early next week.

Zandr Milewski [:zandr]

Assignee

Comment 9

•

13 years ago

(In reply to comment #7)
> w32-ix-slave41 also needs to go back to iX.  It errors with "These memory
> DIMMs are not supported" and then gets into a reboot loop if you hit F1 to
> continue.

Reseated memory and it seems to come up just fine.

Amy Rich [:arr] [:arich]

Reporter

Comment 10

•

13 years ago

w32-ix-slave41 still seems to be in a reboot loop.  Please send it back to iX.

Zandr Milewski [:zandr]

Assignee

Comment 11

•

13 years ago

4620
4625
4764
4799
4810
4794
4617
(the machines from comment 8) are racked and powered

Amy Rich [:arr] [:arich]

Reporter

Comment 12

•

13 years ago

4625 linux-ix-slave06 is not responding on it's primary or ipmi interface.

Amy Rich [:arr] [:arich]

Reporter

Comment 13

•

13 years ago

4810 linux64-ix-slave37 is also unresponsive on both interfaces

Amy Rich [:arr] [:arich]

Reporter

Comment 14

•

13 years ago

linux64-ix-slave26 looks like it probably needs a partition table put on it before I can image it.

Amy Rich [:arr] [:arich]

Reporter

Comment 15

•

13 years ago

The following hosts are now reimaged and ready for postimage/puppetization:

linux-ix-slave01-mgmt
linux-ix-slave06-mgmt
linux-ix-slave33-mgmt
linux64-ix-slave21-mgmt
linux64-ix-slave26-mgmt
linux64-ix-slave37-mgmt
w32-ix-slave23-mgmt

Zandr Milewski [:zandr]

Assignee

Comment 16

•

13 years ago

Beyond the reboot loop, w32-ix-slave41 is also complaining of incompatible dimms. Will pull for repair.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 17

•

13 years ago

(In reply to Amy Rich [:arich] from comment #15)
> The following hosts are now reimaged and ready for postimage/puppetization:
> 
> linux-ix-slave01-mgmt
> linux-ix-slave06-mgmt
> linux-ix-slave33-mgmt
> linux64-ix-slave21-mgmt
> linux64-ix-slave26-mgmt
> linux64-ix-slave37-mgmt
> w32-ix-slave23-mgmt

per irc w/catlee: these have been done as part of bug#673436.

Zandr Milewski [:zandr]

Assignee

Comment 18

•

13 years ago

w32-ix-slave41 (4705) pulled and will deliver to iX.

Zandr Milewski [:zandr]

Assignee

Comment 19

•

13 years ago

So, I had some time to kill waiting for a mini to image and played with 4705 a bit. Flipping to IDE fixed the reboot loop, and I wasn't able to reproduce the memory problem.

So, w32-ix-slave41 is back in service, awaiting postimage.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 20

•

13 years ago

I think this got skipped in yesterday's meeting with IT; neither bear, nor myself have notes on this. What is status?

Amy Rich [:arr] [:arich]

Reporter

Comment 21

•

13 years ago

Matt mentioned that there are one or two hosts that need to go back to iX.

Armen [:armenzg]

Comment 22

•

13 years ago

(In reply to Amy Rich [:arich] from comment #21)
> Matt mentioned that there are one or two hosts that need to go back to iX.

Is this bug 673972?

Amy Rich [:arr] [:arich]

Reporter

Comment 23

•

13 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #22)
> (In reply to Amy Rich [:arich] from comment #21)
> > Matt mentioned that there are one or two hosts that need to go back to iX.
> 
> Is this bug 673972?

I believe so, yes.

Armen [:armenzg]

Updated

•

13 years ago

Depends on: 673972

Whiteboard: last few tracked in bug 673972

Amy Rich [:arr] [:arich]

Reporter

Comment 24

•

13 years ago

The few machines that's are out for repair at IX are now being tracked in 673972, so this bug is redundant at this point.  Resolving this on in favor of the new bug.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations