Closed Bug 890035 (t-w732-ix-115) Opened 11 years ago Closed 7 years ago

t-w732-ix-115 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [buildduty][xperf_q3_disabled] status-in-comment-16)

Disconnecting a lot, dunno what's happening yet. Disabled in slavealloc for now.
Depends on: 890274
Trying a reimage.
Reimage done, back in production
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Showing the same reftest failures that were narrowed down to a slave issue on Friday. Given that this was just reimaged, I wonder if it's some sort of configuration issue.

https://tbpl.mozilla.org/php/getParsedLog.php?id=25262068&tree=Mozilla-Aurora

I'll also point out that the slave health page for this slave looks pretty awful overall.
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=t-w732-ix-115

Disabled in slavealloc.
RyanVM, can you please REOPEN when disabling it? Thanks!

It seems that it did make some Web GL tests fails.
Look at attachment 777760 [details] in bug 873566 to see that the color-depth was messed up.
Blocks: 873566
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Product: mozilla.org → Release Engineering
Depends on: 913030
Hardware diagnostics passed. Back in production after a reimage, let's hope that helped.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Can't run WebGL tests. Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
No longer blocks: 873566
Depends on: 873566
Rebooted into production.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Zero successful xperf runs since July 14th (e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=27984306&tree=Fx-Team).

Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [buildduty] → [buildduty][xperf_q3_disabled]
Putting into staging.
Assignee: nobody → armenzg
Depends on: 945451
Re-imaging.
It is having issues with xperf.
philor, does this ring a bell?

=1=
18:41:33     INFO -  JavaScript warning: http://mochi.test:8888/tests/dom/imptests/html/webgl/common.js, line 3: WebGL: Error during ANGLE OpenGL ES initialization

=2=
http://cl.ly/SoQN

=3=
I don't see anything in here. Maybe the wallpaper?
http://cl.ly/SoQV

=4=
17:16:36     INFO -  JavaScript warning: http://mochi.test:8888/tests/content/canvas/test/webgl/non-conf-tests/webgl-util.js, line 49: WebGL: Error during ANGLE OpenGL ES initialization
Never mind.
I think it is a graphical issue:
http://cl.ly/Sofw
Depends on: 946759
I will file an IT but to chase this issue (which does not happen always).

I saw this Kernel-Power issue right after a failed xperf run [1]

I also saw a "The device, \Device\Ide\iaStor0, did not respond within the timeout period." [2] before the unexpected reboot.

I see a SCSI cabling note in here:
http://social.technet.microsoft.com/Forums/windows/en-US/3356c10e-673f-4882-9e5d-b9d61bafce9f/the-device-deviceideiastor0-did-not-respond-within-the-timeout-period?forum=w7itprogeneral
There's even more info in here:
http://support.microsoft.com/kb/154690

The log ended in here:
13:05:02     INFO -  JavaScript error: http://localhost/page_load_test/tp5n/etsy.com/www.etsy.com/category/geekery/videogame.html, line 326: Etsy is not defined
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

[1]
Log Name:      System
Source:        Microsoft-Windows-Kernel-Power
Date:          1/13/2014 1:07:18 PM
Event ID:      41
Task Category: (63)
Level:         Critical
Keywords:      (2)
User:          SYSTEM
Computer:      T-W732-IX-115.releng.ad.mozilla.com
Description:
The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Kernel-Power" Guid="{331C3B3A-2005-44C2-AC5E-77220C37D6B4}" />
    <EventID>41</EventID>
    <Version>2</Version>
    <Level>1</Level>
    <Task>63</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000002</Keywords>
    <TimeCreated SystemTime="2014-01-13T21:07:18.073239600Z" />
    <EventRecordID>35508</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="8" />
    <Channel>System</Channel>
    <Computer>T-W732-IX-115.releng.ad.mozilla.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="BugcheckCode">0</Data>
    <Data Name="BugcheckParameter1">0x0</Data>
    <Data Name="BugcheckParameter2">0x0</Data>
    <Data Name="BugcheckParameter3">0x0</Data>
    <Data Name="BugcheckParameter4">0x0</Data>
    <Data Name="SleepInProgress">false</Data>
    <Data Name="PowerButtonTimestamp">0</Data>
  </EventData>
</Event>

[2] 
Log Name:      System
Source:        iaStor
Date:          1/13/2014 1:04:27 PM
Event ID:      9
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      T-W732-IX-115.releng.ad.mozilla.com
Description:
The device, \Device\Ide\iaStor0, did not respond within the timeout period.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="iaStor" />
    <EventID Qualifiers="49156">9</EventID>
    <Level>2</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-01-13T21:04:27.134935100Z" />
    <EventRecordID>35500</EventRecordID>
    <Channel>System</Channel>
    <Computer>T-W732-IX-115.releng.ad.mozilla.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\Ide\iaStor0</Data>
    <Binary>0F0028000100000000000000090004C011111111090004C0000000000200000067452301EFCDAB89010000000000CCCC0000B0CD00000000010000001E000000100000003C5177D71000000070F00FDF</Binary>
  </EventData>
</Event>
Depends on: 959663
Assignee: armenzg → nobody
Armen, this machine is still locked to your dev master - what's up with it?
Flags: needinfo?(armenzg)
I was hoping for DCOps to be able to fix the machine and be put back into production.

It seems I will have to run more tests and request servicing by iX if it happens again.

Grabbing it.
Assignee: nobody → armenzg
Flags: needinfo?(armenzg)
Summary
#######
tl;dr requested burn-in to iX

2013-07-03 - disconnecting a lot; re-image requested
2013-07-15 - reftest failures & lots of debugging in bug 873566 with other machines
2013-09-05 - request for diagnostics; nothing found and re-imaged
2013-09-06 - it can't run webgl tests
2013-09-12 - rebooted into production after comment 42 in bug 873566 [1]
2013-09-17 - first mention of xperf failures
2013-11-01 - put into staging
2013-12-02 - re-imaging requested
2013-12-05 - graphical setup might be related
2014-01-14 - diagnostics requested; kernel-power issues spotted - may be cabling issues
2014-01-20 - file request for burn-in


[1] https://bugzilla.mozilla.org/show_bug.cgi?id=873566#c42
Depends on: 961844
Whiteboard: [buildduty][xperf_q3_disabled] → [buildduty][xperf_q3_disabled] status-in-comment-16
Back in production. One orange job and one green so far.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
This the one that we should keep an eye on how it fairs with xperf jobs.
Assignee: armenzg → nobody
QA Contact: armenzg → bugspam.Callek
Status: RESOLVED → REOPENED
Depends on: 1314228
Resolution: FIXED → ---
No longer depends on: 1314853
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1314976)
No longer depends on: 1315210
Back online and taking jobs.
Status: REOPENED → RESOLVED
Closed: 10 years ago8 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed.
Filed IT bug for reboot (bug 1373696)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.