Closed
Bug 938872
(t-xp32-ix-085)
Opened 11 years ago
Closed 10 years ago
t-xp32-ix-085 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
(Whiteboard: [buildduty][buildslaves][capacity])
Attachments
(1 file)
10.12 KB,
image/png
|
Details |
https://tbpl.mozilla.org/php/getParsedLog.php?id=30576711&tree=Mozilla-Inbound is a block of 64 Pink Pixels of Death, a rather impressive feat of memory failure which really ought to show up when this slave has memtest run on it.
Disabled in slavealloc.
Comment 1•11 years ago
|
||
Let's run memtest just out of curiosity. We will swap the mem even if it does not yield signs of it.
Comment 2•11 years ago
|
||
Attempting SSH reboot...Failed.
Filed IT bug for reboot (bug 963858)
Comment 3•11 years ago
|
||
Back in production.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 4•11 years ago
|
||
Either broken, or the reimage went badly:
13:31:42 INFO - 1919 ERROR TEST-UNEXPECTED-FAIL | /tests/gfx/tests/mochitest/test_acceleration.html | Acceleration enabled on Windows XP or newer - didn't expect 0, but got it
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 5•11 years ago
|
||
Attempting SSH reboot...Failed.
Filed IT bug for reboot (bug 974117)
Comment 6•11 years ago
|
||
Gonna try it again after another reimage+diagnostics.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 7•11 years ago
|
||
First job was orange, but not for acceleration reasons...so I think it's legit orange. Still watching the next job.
Comment 8•11 years ago
|
||
Machine is still broken:
11:42:23 INFO - 7502 ERROR TEST-UNEXPECTED-FAIL | /tests/content/canvas/test/webgl/non-conf-tests/test_webgl_available.html | Expected WebGL creation to succeed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 9•11 years ago
|
||
back in production, leaving dep bug 975006 open until first job completes.
Comment 10•11 years ago
|
||
Still burning everything it touches. Disabled again.
https://tbpl.mozilla.org/php/getParsedLog.php?id=35614414&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=35615016&tree=Mozilla-Inbound
Comment 11•11 years ago
|
||
moved to t-w732-ix-002 slot and brought back into production for testing. will post findings in bug 975006
Reporter | ||
Comment 12•11 years ago
|
||
Still broken, disabled again. I'd really very much rather that you not put known broken slaves in production on a Friday and then just wander off.
Comment 13•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #12)
> Still broken, disabled again. I'd really very much rather that you not put
> known broken slaves in production on a Friday and then just wander off.
to clarify:
- these machines were enabled Fri morning PT
- I reported to sheriffs at that time that I was enabling these hosts and warned they might fail.
- as said in comment 11, I'd post findings in bug 975006 and I did after first job which was green
- checking back on things now, it's obvious that our attempt to fix did not work.
I apologize this caused failures and time on your end. Thank you for disabling.
If you would like to question certain practices that I am doing as buildduty, I'm open to a constructive talk. Maybe I can then take your expertise and change the way we do things that would be better for sheriffs and our build system. ie: possibly not enabling machines on a friday.
van: it seems like switching slots did not do the trick. looks like we will have to try something else :)
Comment 14•11 years ago
|
||
van has been working hard on this slave trying to find the root problem.
pete can you please enable this on thurs or friday. Note in above comments I failed to catch it burn jobs. Could you please keep a very close eye on it and catch it burning a job before sheriffs do.
Thanks!
Flags: needinfo?(pmoore)
Reporter | ||
Comment 15•11 years ago
|
||
But note that it's way way harder to spot than a simple "it burns, you'll see that every job is red," since what it actually does is run at far too low a resolution and without graphics acceleration, so when it has test failures, you have to look at the actual failures, and realize that in this case a context-menu test is failing because the resolution is so low that most of the menu is off-screen, not because of the usual intermittent failure bug. Since we have tens of thousands of intermittent failure bugs filed, and since tbpl suggests them based on the test filename, not on the failure message, even if you see that this slave took a job and had a test failure, and when you go to look a sheriff has starred it as being an intermittent failure, that *still* doesn't mean that it was that known failure rather than the busted resolution and lack of acceleration; only knowing to ask yourself "given that this particular slave is probably busted, running with tiny resolution, does this failure still look like a known intermittent rather than it having tiny resolution?" will say whether it's still busted or not.
Well, or just starting it up not in production, and looking at whether it's running at a tiny resolution, it sure seems to me like that ought to be possible somehow.
Comment 16•11 years ago
|
||
Thanks folks.
So it sounds like my best bet is to check the resolution before putting it in production.
It looks like a windows xp 32 bit slave, I'll see if I can VNC onto it.
Phil: what resolution *should* it have?
If this is just a case of the resolution being incorrectly set - is this something wrong with our imaging process - can this be fixed by GPO config?
Also, is it necessary for hardware acceleration to be enabled, and how would we do this, and can we also make that part of the imaging process? If it is not possible on this hardware, and is a requirement of the tests, should we disable the tests that require hardware acceleration from machines that do not (and cannot) have it?
Thanks,
Pete
Flags: needinfo?(pmoore) → needinfo?(philringnalda)
Reporter | ||
Comment 17•11 years ago
|
||
According to a screenshot from a test failure on a healthy winxp slave, resolution should be 1600x1200.
And according to my vague memory of things armenzg has said in bugs and screenshots he has posted to bugs while we've had post-reimaging resolution and graphics problems before, you can compare... maybe it's the Graphics Properties, from right-clicking the desktop?... between a busted slave and a healthy one to find that the busted one is using the wrong graphics card, or only has one when it should have two, or has the right one, but it only thinks it's capable of running smaller resolutions. Not sure, I've never had access to any of our slaves and I no longer own anything running WinXP.
Flags: needinfo?(philringnalda)
Comment 18•11 years ago
|
||
Thanks Philor.
Armen, see comments 16 and 17 above - what are your thoughts?
Pete
Flags: needinfo?(armenzg)
Comment 19•11 years ago
|
||
I found out RelOps are working on adding a test to the start talos bat to check resolution before launching run slave. Hopefully this could help. Not sure whether graphics acceleration is needed. Also RelOps mentioned there could be a physical monitor connected to slave.
RelOps will update this bug to associate it to the start talos check bug they are working on.
Comment 20•11 years ago
|
||
I think it may be worthwhile seeing if that change fixes it, rather than doing a one-time fix.
However, a one-time fix is probably as simple as vnc'ing onto the slave and changing desktop size (not sure about hardware acceleration though).
Comment 21•11 years ago
|
||
It seems it is still using the wrong display.
I've asked IT to look into it in the dep bug.
I created this page (I will adding more info):
https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#Xp
If that fails, we might want to ask for the graphic card to be replaced.
###################
A while ago I wrote a script that adjusts the screen resolution on Win7 machines:
http://hg.mozilla.org/build/tools/file/default/scripts/support/mouse_and_screen_resolution.py
There is code to query screen resolutions.
We should find a way to prevent starting machines up with not big enough screen resolution. We could use runslave.py or start-buildbot.bat to prevent that (since we don't have pre-flight tasks yet).
Flags: needinfo?(armenzg)
Comment 22•11 years ago
|
||
On all of the machines there is a script c:\monitor_config\fakemon.vbs that will detect if the second screen is missing. Add it if necessary then adjust the resolution.
Comment 23•10 years ago
|
||
The screen resolution is now correct.
Rebooting into the pool.
Reporter | ||
Comment 24•10 years ago
|
||
The screen resolution *was* correct, for some apparently brief period. Screenshot in https://tbpl.mozilla.org/php/getParsedLog.php?id=39859470&tree=Mozilla-Inbound is it failing a test at 1024x768, https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=t-xp32-ix-085 is it failing most of the things it tried to do today.
Disabled in slavealloc.
Comment 25•10 years ago
|
||
I don't get it. This is how it looks now:
https://bug1013280.bugzilla.mozilla.org/attachment.cgi?id=8425449
Reporter | ||
Updated•10 years ago
|
QA Contact: armenzg → bugspam.Callek
Comment 26•10 years ago
|
||
Bug 1013280 is fixed, rebooted into production
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•