Closed Bug 959663 Opened 12 years ago Closed 12 years ago

Troubleshoot t-w732-ix-115

Categories

(Infrastructure & Operations :: DCOps, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

(Adding q and markco for this issue). We have a Win7 machine that sometimes when it runs xperf unit tests it will crash and reboot. After running it on staging I eventually managed to hit the issue. This could have been happening all the way back from July ("Disconnecting a lot, dunno what's happening yet.") Could you please determine if new cabling is needed? Or something else? I saw this Kernel-Power issue right after a failed xperf run [1] I also saw a "The device, \Device\Ide\iaStor0, did not respond within the timeout period." [2] before the unexpected reboot. I see a SCSI cabling note in here: http://social.technet.microsoft.com/Forums/windows/en-US/3356c10e-673f-4882-9e5d-b9d61bafce9f/the-device-deviceideiastor0-did-not-respond-within-the-timeout-period?forum=w7itprogeneral There's even more info in here: http://support.microsoft.com/kb/154690 The log ended in here: 13:05:02 INFO - JavaScript error: http://localhost/page_load_test/tp5n/etsy.com/www.etsy.com/category/geekery/videogame.html, line 326: Etsy is not defined remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [1] Log Name: System Source: Microsoft-Windows-Kernel-Power Date: 1/13/2014 1:07:18 PM Event ID: 41 Task Category: (63) Level: Critical Keywords: (2) User: SYSTEM Computer: T-W732-IX-115.releng.ad.mozilla.com Description: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Microsoft-Windows-Kernel-Power" Guid="{331C3B3A-2005-44C2-AC5E-77220C37D6B4}" /> <EventID>41</EventID> <Version>2</Version> <Level>1</Level> <Task>63</Task> <Opcode>0</Opcode> <Keywords>0x8000000000000002</Keywords> <TimeCreated SystemTime="2014-01-13T21:07:18.073239600Z" /> <EventRecordID>35508</EventRecordID> <Correlation /> <Execution ProcessID="4" ThreadID="8" /> <Channel>System</Channel> <Computer>T-W732-IX-115.releng.ad.mozilla.com</Computer> <Security UserID="S-1-5-18" /> </System> <EventData> <Data Name="BugcheckCode">0</Data> <Data Name="BugcheckParameter1">0x0</Data> <Data Name="BugcheckParameter2">0x0</Data> <Data Name="BugcheckParameter3">0x0</Data> <Data Name="BugcheckParameter4">0x0</Data> <Data Name="SleepInProgress">false</Data> <Data Name="PowerButtonTimestamp">0</Data> </EventData> </Event> [2] Log Name: System Source: iaStor Date: 1/13/2014 1:04:27 PM Event ID: 9 Task Category: None Level: Error Keywords: Classic User: N/A Computer: T-W732-IX-115.releng.ad.mozilla.com Description: The device, \Device\Ide\iaStor0, did not respond within the timeout period. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="iaStor" /> <EventID Qualifiers="49156">9</EventID> <Level>2</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2014-01-13T21:04:27.134935100Z" /> <EventRecordID>35500</EventRecordID> <Channel>System</Channel> <Computer>T-W732-IX-115.releng.ad.mozilla.com</Computer> <Security /> </System> <EventData> <Data>\Device\Ide\iaStor0</Data> <Binary>0F0028000100000000000000090004C011111111090004C0000000000200000067452301EFCDAB89010000000000CCCC0000B0CD00000000010000001E000000100000003C5177D71000000070F00FDF</Binary> </EventData> </Event>
FYI, I scheduled a disk check on start-up by mistake. It will be needing on-hands intervention if I understand correctly (unless it manages to reach the log-in page after the scan disk).
colo-trip: --- → scl3
:armen, these hosts have front loading drives and are 4 blade servers to a chassis. there are no SCSI cables and SATA doesnt require active termination. i would need to open up the chassis (and bring down the other 3 hosts) to see if there are any SATA cables being used at all. the drive diagnostics came back negative, so i've reseated the blade server in its chassis and the drive in its bay. please run a few more tests and reopen this bug if issues persist. we can send this to iX and have them do a 48-hour burn in to see if there are any issues they can detect. host is back online. [vle@admin1a.private.scl3 ~]$ fping t-w732-ix-115.wintest.releng.scl3.mozilla.com t-w732-ix-115.wintest.releng.scl3.mozilla.com is alive
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Thanks for looking into this! I will request a burn in if it persists.
Filed bug 961844.
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.