Closed Bug 975006 Opened 10 years ago Closed 10 years ago

please investigate t-xp32-ix-085

Categories

(Infrastructure & Operations :: DCOps, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: waiting for ack)

Attachments

(1 file)

It was recently re-imaged but is having trouble with tests that require hardware acceleration:
13:31:42     INFO -  1919 ERROR TEST-UNEXPECTED-FAIL | /tests/gfx/tests/mochitest/test_acceleration.html | Acceleration enabled on Windows XP or newer - didn't expect 0, but got it
running diags
colo-trip: --- → scl3
Whiteboard: hardware diagnostics
passed diags. reimaging
Whiteboard: hardware diagnostics → reimaging
host is up.

sals-MacBook-Pro-3:~ sal$ sudo fping 10.26.18.61
10.26.18.61 is alive
sals-MacBook-Pro-3:~ sal$ sudo fping 10.26.41.105
10.26.41.105 is alive
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
This machine failed jobs immediately when it came back. The failure mode suggests that the issue may be with the graphics card (nvidia). Do we have tools for diagnosing graphics hardware failure? If so, we should run them, or come up with a protocol if not.

Please verify the graphics card on this machine and report back.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: reimaging
I've replaced the video card and reimaged the host.  Give it a try and if it resolves the issue, please close out the ticket.
Whiteboard: waiting for ack
back in production. I'll close this after first job completes green. Thanks you :)
so I don't know about the previous logs but this is still burning:
https://tbpl.mozilla.org/php/getParsedLog.php?id=35614414&tree=Mozilla-Inbound

I can grep the 'Acceleration enabled on Windows XP or newer' error in comment one but not sure about gfx card issues.

as per problem tracking bug, this has been disabled. What's the next path of escalation for this? Decommission?

Note: t-xp32-ix-008 (similar re-image/vid replacement: 975104) has greened one test but it was talos and seems these are failing unittests.

Leaving open
The hosts are under warranty for 2 more years; we certainly wouldn't decommission them.
we can try to swap this blade into another chassis, this might help us pin point if it's a chassis/slot issue. let us know if you can take down a host or we can wait until we receive another reimage request. 

:sal, do you recall if this was one of the blades that went back to iX for the 48 hour burn-in test?
(In reply to Van Le [:van] from comment #9)
> we can try to swap this blade into another chassis, this might help us pin
> point if it's a chassis/slot issue. let us know if you can take down a host
> or we can wait until we receive another reimage request. 

this host has been disabled out of production. IIUC, we can shut it down anytime if you want to try a new slot for it.

add a needinfo to me for this bug and I can shut down quickly for you.
Hi van. I disabled  t-w732-ix-002 (bug 890312) for you. It has been running pretty well for the last few months. I left it running but it is no longer in production. Feel free to shut down and swap slots when time allocates.

Good luck in troubleshooting!

Let myself or anyone with 'buildduty' in nickname know in #buildduty when you are ready for us to enable  t-w732-ix-002 back in production or when you wish to swap the other troublesome machine (008) with a different healthy machine.
swapped t-w732-ix-002 and t-xp32-ix-085's location with one another. both hosts are currently reimaging, should be completed in ~1hr.

recap:

t-w732-ix-002 is now in 401-3 - 35.10 and reimaged with w7.
t-xp32-ix-085 is now in 401-2 - 28.02 and reimaged with xp32.
:jlund, did you get a chance to put these hosts into production to see if any jobs were burned?
(In reply to Van Le [:van] from comment #13)
> :jlund, did you get a chance to put these hosts into production to see if
> any jobs were burned?

hey, only catching this now.

the swap happened at the end of my buildduty rotation. looks like it hasn't been hit in our queue yet. I am on buildduty tomorrow and will address it before other items so you can get results. I'll post findings tomorrow once it takes some jobs.

sorry for the delay.
these are back in production
both machines passed their first job.

although t-xp32-ix-085 did a jetpack suite and it seems like mochitest is what it really has trouble with.

stay tuned.
van: looks like it did not work. please see https://bugzilla.mozilla.org/show_bug.cgi?id=938872#c13

have a great weekend
jlund: were there any error logs that could help us pin point the issue? this host has passed the iX 48 hour burn in test and the only diagnostics we have are for memory and hard drive which it has also all passed. the only thing i could think of is swapping out the video card and trying again since we dont have any specific tests for that.

we need to go back to iX with hard/presentable errors or issues for them to replace the hardware.
(In reply to Van Le [:van] from comment #18)
> jlund: were there any error logs that could help us pin point the issue?

I have put up public logs of some mochitest jobs that failed on this slave:
http://people.mozilla.org/~jlund/t-xp32-ix-085_fail_log-mochi2-2.txt

if you grep for: 'WARNING -  One or more unittests failed' you can see that 8 tests failed. Here is a snippet:
:23:17     INFO -  [Parent 1132] WARNING: NS_ENSURE_TRUE(gIMM32Handler) failed: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/widget/windows/nsIMM32Handler.cpp, line 254
03:23:17     INFO -  [Parent 1132] WARNING: NS_ENSURE_TRUE(gIMM32Handler) failed: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/widget/windows/nsIMM32Handler.cpp, line 254
03:23:17     INFO -  7919 INFO TEST-PASS | /tests/dom/inputmethod/mochitest/test_delete_focused_element.html | input was blurred.
03:23:17     INFO -  7920 INFO TEST-PASS | /tests/dom/inputmethod/mochitest/test_delete_focused_element.html | textarea was focused.
03:23:17     INFO -  7921 INFO TEST-PASS | /tests/dom/inputmethod/mochitest/test_delete_focused_element.html | textarea was removed.
03:23:17     INFO -  7922 INFO TEST-INFO | MEMORY STAT vsize after test: 931352576
03:23:17     INFO -  7923 INFO TEST-INFO | MEMORY STAT vsizeMaxContiguous after test: 238313472
03:23:17     INFO -  7924 INFO TEST-INFO | MEMORY STAT residentFast after test: 266027008
03:23:17     INFO -  7925 INFO TEST-END | /tests/dom/inputmethod/mochitest/test_delete_focused_element.html | finished in 752ms
03:23:17     INFO -  ++DOMWINDOW == 75 (1D626610) [pid = 1132] [serial = 2395] [outer = 0EE02428]
03:23:18     INFO -  7926 INFO TEST-START | /tests/dom/inputmethod/mochitest/test_sendkey_cancel.html
03:23:18     INFO -  ++DOMWINDOW == 76 (1DDAAC68) [pid = 1132] [serial = 2396] [outer = 0EE02428]
03:23:18     INFO -  ++DOCSHELL 0E7AD208 == 15 [pid = 1132] [id = 344]
03:23:18     INFO -  ++DOMWINDOW == 77 (18EF7480) [pid = 1132] [serial = 2397] [outer = 00000000]
03:23:18     INFO -  ###################################### forms.js loaded
03:23:18     INFO -  ############################### browserElementPanning.js loaded
03:23:18     INFO -  [Parent 1132] WARNING: Subdocument container has no frame: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/layout/base/nsDocumentViewer.cpp, line 2419
03:23:18     INFO -  ++DOMWINDOW == 78 (16B18C18) [pid = 1132] [serial = 2398] [outer = 18EF7480]
03:23:18     INFO -  [Parent 1132] WARNING: NS_ENSURE_TRUE(mMutable) failed: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/netwerk/base/src/nsSimpleURI.cpp, line 265
03:23:18     INFO -  ######################## BrowserElementChildPreload.js loaded
03:23:18     INFO -  [Parent 1132] WARNING: NS_ENSURE_TRUE(mCallback) failed: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/content/base/src/nsFrameMessageManager.cpp, line 640
03:23:18     INFO -  ++DOMWINDOW == 79 (2C5229C0) [pid = 1132] [serial = 2399] [outer = 18EF7480]
03:23:18     INFO -  [Parent 1132] WARNING: NS_ENSURE_TRUE(aSelection->GetRangeCount()) failed: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/editor/libeditor/base/nsEditor.cpp, line 3778
03:23:18     INFO -  [Parent 1132] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/editor/libeditor/base/nsEditor.cpp, line 3757
03:23:18     INFO -  [Parent 1132] WARNING: NS_ENSURE_SUCCESS(res, res) failed with result 0x80004005: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/editor/libeditor/text/nsTextEditRules.cpp, line 418
03:23:18     INFO -  [Parent 1132] WARNING: NS_ENSURE_TRUE(gIMM32Handler) failed: file c:/builds/moz2_slave/m-aurora-w32-d-000000000000000/build/widget/windows/nsIMM32Handler.cpp, line 254
03:23:19     INFO -  JavaScript error: chrome://browser/content/tabbrowser.xml, line 3164: tab is null
03:23:19     INFO -  7927 INFO TEST-PASS | /tests/dom/inputmethod/mochitest/test_sendkey_cancel.html | inputcontextchange event was fired.
03:23:19     INFO -  7928 INFO TEST-PASS | /tests/dom/inputmethod/mochitest/test_sendkey_cancel.html | sendKey was rejected
03:23:19     INFO -  7929 INFO TEST-INFO | MEMORY STAT vsize after test: 931418112
03:23:19     INFO -  7930 INFO TEST-INFO | MEMORY STAT vsizeMaxContiguous after test: 238313472
03:23:19     INFO -  7931 INFO TEST-INFO | MEMORY STAT residentFast after test: 265973760
03:23:19     INFO -  7932 INFO TEST-END | /tests/dom/inputmethod/mochitest/test_sendkey_cancel.html | finished in 986ms
03:23:19     INFO -  ++DOMWINDOW == 80 (172D61B0) [pid = 1132] [serial = 2400] [outer = 0EE02428]
03:23:19     INFO -  7933 INFO TEST-START | Shutdown
03:23:19     INFO -  7934 INFO Passed:  226087
03:23:19  WARNING -  7935 INFO Failed:  8
03:23:19  WARNING -  One or more unittests failed.

the above snippet gives an example of 'WARNING: NS_ENSURE_SUCCESS' and 'failed: file' lines. There are many more in the full log. As contrast, here is a log that passed all its mochitests:
https://tbpl.mozilla.org/php/getParsedLog.php?id=37379013&full=1&branch=mozilla-central

I am not sure what would hardware reasons would be causing these to fail. I see these tests log mentions of RAM/memory/vset but I am not sure if RAM is the issue. You said yourself that you have tested that already.

There are three other logs I have available that all show similar errors. Does this help?
t-xp32-ix-085 is already disabled and is ready to be swapped once it is shutdown.

t-w732-ix-002 I just disabled this slave. please give it 1hour to finish any test it was running before shutting down/swapping.

thnaks!
t-w732-ix-002 -> moved back to original location and started i've reimage process.

t-xp32-ix-085 -> i honestly can't tell what the problem is from those logs but since it does state memory issues, i've replaced the 2 hardware that has memory and is a field replaceable unit (FRU), the video card and the memory DIMMs itself. ive kicked off the reimage process, please run some tests on it after completion and let me know if this resolves the issue. by swapping chassis, we can at least confirm that it is not a slot issue in the chassis and not a hard drive issue (since the hard drive doesnt move with the node, hence the reimages).
Sorry, didn't realize I wasn't cc'ed on this bug. The thing you want to look for in logs is "UNEXPECTED-FAIL", which will lead you in that log to various WebGL tests failing saying "should be able to get a context," which is their way of saying "WebGL ought to work on something running on this OS, why have you taunted me by running me on something with tiny resolution and no graphics acceleration?" The warnings and memory stuff are all just logspam, every failure that we've had with this slave going back to February when it came back from iX has been because it was running at a tiny resolution without hardware graphics acceleration.
confirmed the video setting is 1600x1200. please run tests and let me know if issues are resolved.
Not showing 1600x1200.
IIUC it is using the wrong graphic card.
I'm unable to bump the resolution any higher than 1280x1024 and graphics card is already set to 3rd party.  Going to reimage and see what's up.
Q or marko - Even after a reimage, highest resolution I can set is 1280x1024.  I tried to disable the onboard video adapter under device manager but looks like you guys set a pw restriction now.  Can one of you guys take a look please?
Flags: needinfo?(q)
Flags: needinfo?(mcornmesser)
I am taking a look at it should not need the on-board display disabled.
Flags: needinfo?(q)
I've reimaged the host and set the resolution to 1600x1200.  Originally it was setting the onboard graphics card as the primary monitor within the OS.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Flags: needinfo?(mcornmesser)
Thank you for looking into it!
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: