Closed Bug 672973 Opened 13 years ago Closed 13 years ago

iX hardware issues in scl1 post heatsink/fan/RAM modifications

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: zandr)

References

()

Details

(Whiteboard: last few tracked in bug 673972)

The following ix machines in scl1 still exhibit issues after the heatsink/fan/RAM modifications and should return to iX for diagnosis and repairs.

linux-ix-slave01
linux-ix-slave06
linux-ix-slave33
linux64-ix-slave26
linux64-ix-slave37
w32-ix-slave41

w32-ix-slave23 is already at ix for hardware issues.

The w64-ix machines have not been tested since we are waiting on the windows builder vlan configuration to be complete before imaging those machines.

The following w32 hosts have also not been tested, since it has not been possible to break the boot cycle to get to the bios to change the boot order so that network is first (these will likely require crash carting):

w32-ix-slave01
w32-ix-slave03
w32-ix-slave26
w32-ix-slave29

The included spreadsheet URL will be used to track status.
colo-trip: --- → scl1
(In reply to comment #0)

> linux-ix-slave01    
> linux-ix-slave06    
> linux-ix-slave33    
> linux64-ix-slave26  
> linux64-ix-slave37       

These machines have been pulled, stacked on the cart, and iX notified to pick them up.

> w32-ix-slave41 
I think this might be an entirely blank drive. If there isn't even a partition table, Deploystudio won't see it. I'll investigate this tomorrow.

> w32-ix-slave01
> w32-ix-slave03
> w32-ix-slave26
> w32-ix-slave29

These four machines have been set to netboot in BIOS and I've kicked off reimaging them with the win32-ix-ref-110527 image.

Spreadsheet updated.
Please add linux64-ix-slave21 to the list of machines to go to iX - it is hung with a machine check exception (4 bank 5 and lots of 0's)
Please add linux64-ix-slave36 to the list of machines to check for an initial partition table.  It's also rebooting into ds without finding a disk.
w32-ix-slave41: wrote partition table, started imaging from w32-ix-ref-110527

linux64-ix-slave36: wrote partition table, started imaging from linux64-ix-ref-110527

linux64-ix-slave21: pulled and delivered to iX.

I haven't put these updates in the spreadsheet yet.
(updates are in the spreadsheet now)
w32-ix-slave41 also needs to go back to iX.  It errors with "These memory DIMMs are not supported" and then gets into a reboot loop if you hit F1 to continue.
Assignee: server-ops-releng → zandr
Here's the latest from iX on these machines:

Asset # 4620 / A1-16072 - This system is currently under further
diagnosis at the moment.

4625 / A1-16077 - This system had a bad board and we are now processing
the RMA for replacement.

4764 / A1-16163 - This system is under further diagnosis as well
however, we did discover a failed disk. The replacement has been made
(with WD) and testing has resumed.

4799 / A1-16198 - This system had a failed disk as well and the drive
has been replaced (with WD).

4810 / A1-16209 - We are currently diagnosing this box but are looking
at one of the new modules as the culprit. Regardless, we will be
confirming so by tomorrow I believe.

4794 / A1-16193 - This one was off the list but it appears the drive may
be the culprit. Should have confirmation on this by tomorrow as well.

4617 / A1-16069 - This is the repeat offender we've had here 3 times
now. We're certain at this point the board is the culprit and are
waiting on the replacement to arrive. ETA is currently 7/26


We should get them back onsite early next week.
(In reply to comment #7)
> w32-ix-slave41 also needs to go back to iX.  It errors with "These memory
> DIMMs are not supported" and then gets into a reboot loop if you hit F1 to
> continue.

Reseated memory and it seems to come up just fine.
w32-ix-slave41 still seems to be in a reboot loop.  Please send it back to iX.
4620
4625
4764
4799
4810
4794
4617
(the machines from comment 8) are racked and powered
4625 linux-ix-slave06 is not responding on it's primary or ipmi interface.
4810 linux64-ix-slave37 is also unresponsive on both interfaces
linux64-ix-slave26 looks like it probably needs a partition table put on it before I can image it.
The following hosts are now reimaged and ready for postimage/puppetization:

linux-ix-slave01-mgmt
linux-ix-slave06-mgmt
linux-ix-slave33-mgmt
linux64-ix-slave21-mgmt
linux64-ix-slave26-mgmt
linux64-ix-slave37-mgmt
w32-ix-slave23-mgmt
Beyond the reboot loop, w32-ix-slave41 is also complaining of incompatible dimms. Will pull for repair.
(In reply to Amy Rich [:arich] from comment #15)
> The following hosts are now reimaged and ready for postimage/puppetization:
> 
> linux-ix-slave01-mgmt
> linux-ix-slave06-mgmt
> linux-ix-slave33-mgmt
> linux64-ix-slave21-mgmt
> linux64-ix-slave26-mgmt
> linux64-ix-slave37-mgmt
> w32-ix-slave23-mgmt

per irc w/catlee: these have been done as part of bug#673436.
w32-ix-slave41 (4705) pulled and will deliver to iX.
So, I had some time to kill waiting for a mini to image and played with 4705 a bit. Flipping to IDE fixed the reboot loop, and I wasn't able to reproduce the memory problem.

So, w32-ix-slave41 is back in service, awaiting postimage.
I think this got skipped in yesterday's meeting with IT; neither bear, nor myself have notes on this. What is status?
Matt mentioned that there are one or two hosts that need to go back to iX.
(In reply to Amy Rich [:arich] from comment #21)
> Matt mentioned that there are one or two hosts that need to go back to iX.

Is this bug 673972?
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #22)
> (In reply to Amy Rich [:arich] from comment #21)
> > Matt mentioned that there are one or two hosts that need to go back to iX.
> 
> Is this bug 673972?

I believe so, yes.
Depends on: 673972
Whiteboard: last few tracked in bug 673972
The few machines that's are out for repair at IX are now being tracked in 673972, so this bug is redundant at this point.  Resolving this on in favor of the new bug.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.