Closed
Bug 876773
(t-w732-ix-118)
Opened 11 years ago
Closed 7 years ago
t-w732-ix-118 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
()
Details
(Whiteboard: [buildduty][buildslaves][capacity])
Attachments
(3 files)
Pink Pixel of Death in https://tbpl.mozilla.org/php/getParsedLog.php?id=23485895&tree=Mozilla-Central, need to look into its memory.
Comment 1•11 years ago
|
||
Disabled in slavealloc. I assume we need to run memtest.
Comment 2•11 years ago
|
||
Memtest did not yield anything. Maybe memtest does not work well for Windows memory checks? Any suggestions on what to do?
Comment 3•11 years ago
|
||
Maybe we can add more fuzzing to these tests? 09:34:02 INFO - REFTEST TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/flexbox/flexbox-position-fixed-3.xhtml | image comparison (==), max difference: 128, number of differing pixels: 2 09:38:52 INFO - REFTEST TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/svg/text/display-none-1.svg | image comparison (==), max difference: 128, number of differing pixels: 2
Reporter | ||
Comment 4•11 years ago
|
||
If you load the log and click the "for push 7be40b778117" link, you'll be taken to https://tbpl.mozilla.org/?rev=7be40b778117. Then you can find the orange Win7 reftest R, click it, and in the lower middle of the screen there'll be a link to "open reftest analyzer." Once you open that, you can click on each test filename, then click the "circle differences" checkbox. That will put a red box around the 2 pixels which differ between the test and the reference displays. If you don't spot the box right off, it's clear down at the bottom a bit right of center, miles away from where anything at all was rendered, in the background. fuzzy-if() is for tests where there's something variable in the rendering, like antialiasing, or platform-specific scrollbars. In this case, it's fuzzy-if('the slave has a couple of flaky bits of memory',128,2), and those two tests just happen to have been the two that had the bad luck to try to use those bits during this run. To make fuzzing work for this problem, we would need to add the slavename to the sandbox, and make every single test which is run on this slave fuzzy for 2 pixels more than it is already fuzzy for, so that a test where we expect 20 pixels to be off would be 22 instead. Or, slightly more within the realm of possibility, we could add a pool of test slaves that we know are broken, and bypass the step where I see that it did a PPoD and retrigger, by just making reftest failures in that pool set RETRY. Except that we can't, because there are two sorts of reftest, == where the test rendering has to be visually identical to the reference rendering, and != where the test rendering has to be different than the reference rendering. While a PPoD causes false orange in == tests, it causes false green in != tests, so that pool of known broken slaves would need to set RETRY for both orange and for green. We could have a pool of known broken slaves that we only allow to run tests which are not reftests, except that where we really first started realizing that we had bad memory was not in reftests, it was in crashes when GC happened to hit a pointer stored in bad memory, and went chasing off into memory where it had no business going and crashed. So, if a slave in this state says a reftest run was orange, it wasn't, or it was; if it says a reftest run was green, it was, or it wasn't; if it says a crash happened, it did, or it didn't; for that matter, since the bit flip could be where we are storing any variable that any test compares against, if it says that any test run at all was green or orange or red, it was or it wasn't. It's not just that the 11 runs out of the 222 that this slave has done which were orange are suspect - every single one of the green ones is suspect too, they could just as easily have been on a rev that should have failed, but didn't.
Comment 5•11 years ago
|
||
Comment 6•11 years ago
|
||
Comment 7•11 years ago
|
||
Reimaged and back in production
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 8•11 years ago
|
||
Something related to graphics, but not my first thought, resolution, is busted on it. It can pass regular reftests, but failed unaccelerated reftests, and fails everything that depends on WebGL. Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 9•11 years ago
|
||
Requested DCOps to look into the graphics setup.
Comment 10•11 years ago
|
||
Rebooted hesitantly into production. After DCOps intervention I can see a good screen resolution but I still see two monitors. I will keep an eye on this.
Assignee: nobody → armenzg
Comment 11•11 years ago
|
||
We're debugging things with DCOps again.
Comment 12•11 years ago
|
||
I've put it back into production after DCOps disabled the second monitor.
Reporter | ||
Comment 13•11 years ago
|
||
Disabled in slavealloc after verifying that the reason it failed two of the three runs it did was that it still cannot run webgl tests.
Comment 15•11 years ago
|
||
Rebooted into production.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 16•11 years ago
|
||
And exactly like it did in May, it fails reftests with a couple of pixels a bit off. Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 17•11 years ago
|
||
1) Let's re-image 2) Put it on staging to run reftests 2.a) if good put in production 2.b) if not, file bug for DCOps to look into it (maybe memory corruption? graphic card replacement?)
Comment 18•11 years ago
|
||
Armen last comment is from you, can you elaborate on current state and drive it through
Flags: needinfo?(armenzg)
Updated•11 years ago
|
Assignee: nobody → armenzg
Flags: needinfo?(armenzg)
Comment 19•11 years ago
|
||
It got re-imaged on bug 916837. Time to put it on staging.
Comment 20•11 years ago
|
||
It fails on staging: http://dev-master01.build.scl1.mozilla.com:8042/buildslaves/t-w732-ix-118?numbuilds=50 I see small resolution and two displays.
Updated•11 years ago
|
Assignee: armenzg → nobody
Comment 21•11 years ago
|
||
It looks good now: http://dev-master01.build.scl1.mozilla.com:8042/buildslaves/t-w732-ix-118?numbuilds=50 Fixing the graphic cards setup in bug 914184 did the trick. Putting into production.
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 22•11 years ago
|
||
Hmm, I though we had access to dev-master01 through the VPN now. At any rate, it's still completely busted, in particular failing every reftest run it attempts, disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 23•11 years ago
|
||
WRT to the VPN, I believe you might have VPN access per host. I've re-enabled the master and re-trigger jobs for the slave: http://dev-master01.build.scl1.mozilla.com:8042/buildslaves/t-w732-ix-118?numbuilds=50 Let me check tomorrow.
Assignee: nobody → armenzg
Comment 24•11 years ago
|
||
This is rather annoying. All opt and talos jobs came out green. I'm going to try a newer m-i build for debug test jobs as well.
Comment 25•11 years ago
|
||
It has only failed mochitest-other with this: 16:39:26 INFO - 5614 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/a11y/accessible/tests/mochitest/jsat/test_braille.html | uncaught exception - NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName] at resource://gre/modules/accessibility/OutputGenerator.jsm:217 16:39:26 INFO - JavaScript error: resource://gre/modules/accessibility/OutputGenerator.jsm, line 217: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName] 13:13:53 INFO - 5613 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/a11y/accessible/tests/mochitest/jsat/test_braille.html | uncaught exception - NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName] at resource://gre/modules/accessibility/OutputGenerator.jsm:217 13:13:53 INFO - JavaScript error: resource://gre/modules/accessibility/OutputGenerator.jsm, line 217: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName] I'm re-triggered the opt & debug jobs.
Reporter | ||
Comment 26•11 years ago
|
||
If you don't like seeing that test_braille failure, you'll need to pick a new inbound build, that was permaorange for a while yesterday.
Comment 27•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #26) > If you don't like seeing that test_braille failure, you'll need to pick a > new inbound build, that was permaorange for a while yesterday. If that is the case, did we remove the slave too early this last time? I tried loading the jobs that had failed in production, however, the logs were long gone.
Reporter | ||
Comment 28•11 years ago
|
||
Um, no? Given https://secure.pub.build.mozilla.org/buildapi/recent/t-w732-ix-118 I would have disabled it the last time for failing reftest-noaccel on birch, reftest-noaccel on inbound, reftest on fx-team, reftest-noaccel on aurora. birch and inbound are related, but in the other direction: birch runs inbound csets from the day before. So for that theory to fly, we would have had to land reftest bustage on inbound, left it in permaorange for a full day, merged it to fx-team, and then uplifted it to aurora.
Comment 29•11 years ago
|
||
* on comment 2 we checked for memory failures and did not notice anything * It got re-imaged in comment 19 * The graphics setup got fixed in comment 21 * philor put it out of the pool in comment 22 * running jobs on dev-master01 does not show how this machine could be broken I'm thinking of requesting to replace the memory and perhaps the graphic card. Any thoughts?
Comment 30•11 years ago
|
||
I've seen this: 12:09:29 INFO - Error 87 opening target process 12:09:29 INFO - Can't trigger Breakpad, just killing process 12:09:29 INFO - Failed to kill process 3428: [Error 87] The parameter is incorrect 12:09:29 WARNING - TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_event_target_iframe_apps_oop.html | application terminated with exit code 259 I believe this output comes from here: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#664 Have we started seen this recently? I deployed a newer mozprocess this week. I assume we crash due to memory corruption and we try to inject breakpad which should work, no? I uploaded this log: http://people.mozilla.org/~armenzg/incoming/bug876773.log.txt
Comment 31•11 years ago
|
||
In staging now after a memory swap with t-w732-ix-082.
Comment 32•11 years ago
|
||
I see two displays. Probably happened after the swapping of the memory.
Comment 33•11 years ago
|
||
The display setup got fixed on bug 937778 and staging looked good. I hope it stays like that!
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 34•11 years ago
|
||
Another PPoD in https://tbpl.mozilla.org/php/getParsedLog.php?id=31834385&tree=Mozilla-Inbound.
Comment 35•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #34) > Another PPoD in > https://tbpl.mozilla.org/php/getParsedLog.php?id=31834385&tree=Mozilla- > Inbound. So we swapped the memory. Should we try swapping the graphic card?
Reporter | ||
Comment 36•11 years ago
|
||
Another PPoD in https://tbpl.mozilla.org/php/getParsedLog.php?id=32134594&tree=Fx-Team
Comment 37•10 years ago
|
||
YPoD: https://tbpl.mozilla.org/php/getParsedLog.php?id=32712713&tree=Mozilla-Inbound
Comment 38•10 years ago
|
||
And another. Disabled. https://tbpl.mozilla.org/php/getParsedLog.php?id=32716698&tree=Mozilla-Inbound
Reporter | ||
Comment 39•10 years ago
|
||
Oops, we might want to have reopened that so it wouldn't just sit idle without anything being done about it for 99 days.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 40•10 years ago
|
||
Thanks for the catch. For the following reasons... this host has been through: - memtest checks - re-imaging - memory card replacement -> bug 932818 - single display graphics setup -> bug 937778 and Bug 914184 (just verified over vnc too) this seems to be outside of the similar issues with mac ppod machines: - bug ppod resolution was to replace memory and *only* worked for mac machines I think it is worth investigating gfx card for memory issues by either running appropriate diagnostics (if they exist) or swapping/replacing gfx card. I will leave that up to IT.
Updated•10 years ago
|
Assignee: armenzg → nobody
Reporter | ||
Comment 41•10 years ago
|
||
Reenabled and rebooted, though I don't have much faith.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
QA Contact: armenzg → bugspam.Callek
Resolution: --- → FIXED
Comment 42•9 years ago
|
||
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1224640)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Updated•9 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → FIXED
Comment 43•9 years ago
|
||
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1225008)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Updated•9 years ago
|
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Comment 44•8 years ago
|
||
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1255015)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 45•8 years ago
|
||
Got a new drive and a re-image. Taking jobs at the moment.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 8 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Updated•7 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Resolution: --- → FIXED
Comment 46•7 years ago
|
||
Attempting SSH reboot...Failed. Filed IT bug for reboot (bug 1375617)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 47•7 years ago
|
||
Back online
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•