Closed Bug 876773 (t-w732-ix-118) Opened 11 years ago Closed 7 years ago

t-w732-ix-118 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

()

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Attachments

(3 files)

Pink Pixel of Death in https://tbpl.mozilla.org/php/getParsedLog.php?id=23485895&tree=Mozilla-Central, need to look into its memory.
Disabled in slavealloc.

I assume we need to run memtest.
Depends on: 875811
Memtest did not yield anything.
Maybe memtest does not work well for Windows memory checks?
Any suggestions on what to do?
Maybe we can add more fuzzing to these tests?
09:34:02     INFO -  REFTEST TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/flexbox/flexbox-position-fixed-3.xhtml | image comparison (==), max difference: 128, number of differing pixels: 2
09:38:52     INFO -  REFTEST TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/svg/text/display-none-1.svg | image comparison (==), max difference: 128, number of differing pixels: 2
If you load the log and click the "for push 7be40b778117" link, you'll be taken to https://tbpl.mozilla.org/?rev=7be40b778117. Then you can find the orange Win7 reftest R, click it, and in the lower middle of the screen there'll be a link to "open reftest analyzer." Once you open that, you can click on each test filename, then click the "circle differences" checkbox. That will put a red box around the 2 pixels which differ between the test and the reference displays. If you don't spot the box right off, it's clear down at the bottom a bit right of center, miles away from where anything at all was rendered, in the background.

fuzzy-if() is for tests where there's something variable in the rendering, like antialiasing, or platform-specific scrollbars. In this case, it's fuzzy-if('the slave has a couple of flaky bits of memory',128,2), and those two tests just happen to have been the two that had the bad luck to try to use those bits during this run. To make fuzzing work for this problem, we would need to add the slavename to the sandbox, and make every single test which is run on this slave fuzzy for 2 pixels more than it is already fuzzy for, so that a test where we expect 20 pixels to be off would be 22 instead.

Or, slightly more within the realm of possibility, we could add a pool of test slaves that we know are broken, and bypass the step where I see that it did a PPoD and retrigger, by just making reftest failures in that pool set RETRY.

Except that we can't, because there are two sorts of reftest, == where the test rendering has to be visually identical to the reference rendering, and != where the test rendering has to be different than the reference rendering. While a PPoD causes false orange in == tests, it causes false green in != tests, so that pool of known broken slaves would need to set RETRY for both orange and for green.

We could have a pool of known broken slaves that we only allow to run tests which are not reftests, except that where we really first started realizing that we had bad memory was not in reftests, it was in crashes when GC happened to hit a pointer stored in bad memory, and went chasing off into memory where it had no business going and crashed.

So, if a slave in this state says a reftest run was orange, it wasn't, or it was; if it says a reftest run was green, it was, or it wasn't; if it says a crash happened, it did, or it didn't; for that matter, since the bit flip could be where we are storing any variable that any test compares against, if it says that any test run at all was green or orange or red, it was or it wasn't. It's not just that the 11 runs out of the 222 that this slave has done which were orange are suspect - every single one of the green ones is suspect too, they could just as easily have been on a rev that should have failed, but didn't.
Depends on: 889420
Reimaged and back in production
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Something related to graphics, but not my first thought, resolution, is busted on it. It can pass regular reftests, but failed unaccelerated reftests, and fails everything that depends on WebGL.

Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Product: mozilla.org → Release Engineering
Depends on: 914184
Requested DCOps to look into the graphics setup.
Rebooted hesitantly into production.
After DCOps intervention I can see a good screen resolution but I still see two monitors.
I will keep an eye on this.
Assignee: nobody → armenzg
We're debugging things with DCOps again.
I've put it back into production after DCOps disabled the second monitor.
Disabled in slavealloc after verifying that the reason it failed two of the three runs it did was that it still cannot run webgl tests.
Depends on: 873566
Let's see how bug 873566 plays out.
Assignee: armenzg → nobody
Rebooted into production.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
And exactly like it did in May, it fails reftests with a couple of pixels a bit off.

Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 916837
1) Let's re-image
2) Put it on staging to run reftests
2.a) if good put in production
2.b) if not, file bug for DCOps to look into it (maybe memory corruption? graphic card replacement?)
Armen last comment is from you, can you elaborate on current state and drive it through
Flags: needinfo?(armenzg)
Assignee: nobody → armenzg
Flags: needinfo?(armenzg)
It got re-imaged on bug 916837.

Time to put it on staging.
It fails on staging:
http://dev-master01.build.scl1.mozilla.com:8042/buildslaves/t-w732-ix-118?numbuilds=50

I see small resolution and two displays.
Assignee: armenzg → nobody
It looks good now:
http://dev-master01.build.scl1.mozilla.com:8042/buildslaves/t-w732-ix-118?numbuilds=50

Fixing the graphic cards setup in bug 914184 did the trick.

Putting into production.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Hmm, I though we had access to dev-master01 through the VPN now.

At any rate, it's still completely busted, in particular failing every reftest run it attempts, disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
WRT to the VPN, I believe you might have VPN access per host.

I've re-enabled the master and re-trigger jobs for the slave:
http://dev-master01.build.scl1.mozilla.com:8042/buildslaves/t-w732-ix-118?numbuilds=50

Let me check tomorrow.
Assignee: nobody → armenzg
This is rather annoying.
All opt and talos jobs came out green.
I'm going to try a newer m-i build for debug test jobs as well.
It has only failed mochitest-other with this:
16:39:26     INFO -  5614 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/a11y/accessible/tests/mochitest/jsat/test_braille.html | uncaught exception - NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName] at resource://gre/modules/accessibility/OutputGenerator.jsm:217
16:39:26     INFO -  JavaScript error: resource://gre/modules/accessibility/OutputGenerator.jsm, line 217: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName]

13:13:53     INFO -  5613 ERROR TEST-UNEXPECTED-FAIL | chrome://mochitests/content/a11y/accessible/tests/mochitest/jsat/test_braille.html | uncaught exception - NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName] at resource://gre/modules/accessibility/OutputGenerator.jsm:217
13:13:53     INFO -  JavaScript error: resource://gre/modules/accessibility/OutputGenerator.jsm, line 217: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIStringBundle.GetStringFromName]


I'm re-triggered the opt & debug jobs.
If you don't like seeing that test_braille failure, you'll need to pick a new inbound build, that was permaorange for a while yesterday.
(In reply to Phil Ringnalda (:philor) from comment #26)
> If you don't like seeing that test_braille failure, you'll need to pick a
> new inbound build, that was permaorange for a while yesterday.

If that is the case, did we remove the slave too early this last time?
I tried loading the jobs that had failed in production, however, the logs were long gone.
Um, no? Given https://secure.pub.build.mozilla.org/buildapi/recent/t-w732-ix-118 I would have disabled it the last time for failing reftest-noaccel on birch, reftest-noaccel on inbound, reftest on fx-team, reftest-noaccel on aurora. birch and inbound are related, but in the other direction: birch runs inbound csets from the day before. So for that theory to fly, we would have had to land reftest bustage on inbound, left it in permaorange for a full day, merged it to fx-team, and then uplifted it to aurora.
* on comment 2 we checked for memory failures and did not notice anything
* It got re-imaged in comment 19
* The graphics setup got fixed in comment 21
* philor put it out of the pool in comment 22
* running jobs on dev-master01 does not show how this machine could be broken

I'm thinking of requesting to replace the memory and perhaps the graphic card.

Any thoughts?
Depends on: 932818
I've seen this:
12:09:29     INFO -  Error 87 opening target process
12:09:29     INFO -  Can't trigger Breakpad, just killing process
12:09:29     INFO -  Failed to kill process 3428: [Error 87] The parameter is incorrect
12:09:29  WARNING -  TEST-UNEXPECTED-FAIL | /tests/layout/base/tests/test_event_target_iframe_apps_oop.html | application terminated with exit code 259

I believe this output comes from here:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#664

Have we started seen this recently? I deployed a newer mozprocess this week.

I assume we crash due to memory corruption and we try to inject breakpad which should work, no?

I uploaded this log:
http://people.mozilla.org/~armenzg/incoming/bug876773.log.txt
In staging now after a memory swap with t-w732-ix-082.
Depends on: 937778
I see two displays.
Probably happened after the swapping of the memory.
The display setup got fixed on bug 937778 and staging looked good.
I hope it stays like that!
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
(In reply to Phil Ringnalda (:philor) from comment #34)
> Another PPoD in
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31834385&tree=Mozilla-
> Inbound.

So we swapped the memory.
Should we try swapping the graphic card?
Oops, we might want to have reopened that so it wouldn't just sit idle without anything being done about it for 99 days.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Thanks for the catch.

For the following reasons...

this host has been through:
- memtest checks
- re-imaging
- memory card replacement -> bug 932818
- single display graphics setup -> bug 937778 and Bug 914184 (just verified over vnc too)

this seems to be outside of the similar issues with mac ppod machines:
- bug ppod resolution was to replace memory and *only* worked for mac machines

I think it is worth investigating gfx card for memory issues by either running appropriate diagnostics (if they exist) or swapping/replacing gfx card. I will leave that up to IT.
Depends on: 999222
Assignee: armenzg → nobody
Reenabled and rebooted, though I don't have much faith.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
QA Contact: armenzg → bugspam.Callek
Resolution: --- → FIXED
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1224640)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1225008)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1255015)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Got a new drive and a re-image. Taking jobs at the moment.
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1332720
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed.
Filed IT bug for reboot (bug 1375617)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Back online
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: