Closed Bug 1121199 Opened 9 years ago Closed 9 years ago

Green up 10.10 tests currently failing on try

Categories

(Testing :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: kmoir)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Attached file 10.10results.txt —
https://treeherder.mozilla.org/#/jobs?repo=cedar&revision=f5947d58ab02&filter-platform=10-10

Current status is as follows: failures in 

mochitest-1
mochitest-2
mochitest-5
mochitest-browser-chrome-1
mochitest-devtools-chrome
mochitest-gl
mochitest-other
reftest

I'll attach links to the relevant logs.

I'm going to try to parse the logs like this http://ahal.ca/blog/2014/consume-structured-test-results/ in future iterations

I've omitted results that fail on cedar but are not enabled on other branches.
Summary: 10.10 tests currently failing on cedar → Green up 10.10 tests currently failing on cedar
it looks like m1 is running the gl tests, look at the failures of m1 and gl, they are similar.  It could be that we have blacklisted/whitelisted a driver/graphics card on the 10.10 setup which is causing things to fail or not be setup correctly.

First thing to fix is to ensure that the gl tests are not run in m1.
Re comment #1. Yes I realize the gl tests issue, I believe this is bug https://bugzilla.mozilla.org/show_bug.cgi?id=1051886#c61 addresses the issue I'll ping jgilbert to see if this can be updated.
Flags: needinfo?(gijskruitbosch+bugs)
Depends on: 947690
Depends on: 1121835
Depends on: 1122478
Flags: needinfo?(gijskruitbosch+bugs)
Not bothering with dependencies, since they are a bunch of perpetually-open bugs nobody will ever actually fix, but adjusted the annotations for bug 744125, bug 639705, bug 756885, bug 786938. Removing the bug 1121487 dependency, since bug 744125 says even if pointerlock doesn't crash, it's never going to actually run.
No longer depends on: 1121487
Depends on: 1122875
Depends on: 1122882
Depends on: 1122959
Depends on: 1122992
Depends on: 1123085
Depends on: 792304
Depends on: 1123195
jmaher: regarding your comment on mochitest-gl that it might be a blacklisted/whitelisted a driver/graphics card on the 10.10 setup, how can I tell if this is the issue?  btw, the tests are running on the same hardware as the 10.8 tests.
Flags: needinfo?(jmaher)
luckily this is the same hardware as 10.8, maybe in the software the driver version is different.  I recall a list somewhere, let me just pull in a graphics guy.

:bjacob, can you help me determine if there is a whitelist/blacklist of drivers for the gl tests and maybe how to determine on a osx 10.10 machine what the value is?
Flags: needinfo?(jmaher) → needinfo?(jacob.benoit.1)
No longer sucker^Wworking at mozilla!
Flags: needinfo?(jacob.benoit.1)
You want :jgilbert or :djg.
:jgilbert, can you help us determine if there is a software driver whitelist/blacklist for the new osx 10.10 machines and the gl tests?
Flags: needinfo?(jgilbert)
(In reply to Joel Maher (:jmaher) from comment #8)
> :jgilbert, can you help us determine if there is a software driver
> whitelist/blacklist for the new osx 10.10 machines and the gl tests?

Our MO for this is to just mark the failing tests as failing, and file bugs so we can fix them later. I don't think blacklisting is an issue here.
Flags: needinfo?(jgilbert)
No longer depends on: 1121480
No longer depends on: 1121505
Depends on: 1124549
Depends on: 1125003
Removing some dependencies that are fixed by disabling in leave-open bugs.
No longer depends on: 792304, 947690, 1122872
Depends on: 1125479
Current status: https://treeherder.mozilla.org/#/jobs?repo=try&revision=985356af18a4

* talos isn't running on Try, though the only thing I saw from it on Cedar was the suspicious way that 0001 and only 0001 would intermittently be unable to resolve graphs.mozilla.org. 0002 and 0003 were Try build slaves before they became 10.10 test slaves, but I think 0001 was a 10.9 slave, and before that... fell off the back of a truck? found under someone's desk? so maybe we don't have a general problem with DNS on 10.10, maybe.

* the unexpected pass on reftests/css-gradients/aja-linear-1b.html on debug only, a test which we currently expect only to pass on d2d, I can neither explain nor annotate, someone else can have that.

* debug crashtest is the biggest problem - I disabled a test in bug 1123195 for consistently hanging, and now instead we hang in one of several later tests (sometime instead of hanging in 724978.xhtml, we'll time out taking the snapshot in it, and instead hang in 730559.html), so I doubt we can just disable our way to victory there, and we'll instead have to convince someone's manager to convince them to take several days with a loaner to determine why we are so slow and hang-prone running debug tests (it's only crashtests where we're consistently hanging, but in general debug tests are taking 1.5 times as long as the same suite on debug 10.8, though the slaves are supposed to in theory be comparable). I don't have any idea which manager or which engineer, so someone else can have that.

* when crashtest hangs, instead of killing the process and getting a stack, we get "NameError: global name 'proc' is not defined", and I don't know whether that's releng or ateam or what thing is doing the wrong thing, so someone else can have that, and they should ideally induce a crash and/or hang in every class of suite to be sure we're actually getting usable stacks everywhere.
Given https://treeherder.mozilla.org/logviewer.html#?job_id=538073&repo=mozilla-aurora on 10.8, it's also possible that we're lucky we don't hang in crashtests very often because we're completely broken for doing anything useful when we do, on either all-Mac, or on every platform, dunno.
Filed bug 1125679 for the crashtest hang stack thing, which has actually been broken on both Linux and Mac since last October, and has nothing to do with this.
Thanks philor for all this investigation, the tests look much greener now.

The 10.10 machines are 10.8 machines, same rev of hardware that were reimaged, to my understanding.  So no dodgy minis off the back of a truck :-)

Regarding comment #11 and the graph server.  I checked the graphs database and talos data was entered into it for 10.10 machines. I added a patch in bug 1125853 to add it to graphs/data.sql

It looks like disabling 724978.xhtml and 730559.html did make the crashtests green, although on an intermittent basis. 
http://hg.mozilla.org/try/rev/4993c517e85e90298d22d5f2fa2265e53d05e5ad
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4993c517e85e&filter-searchStr=Rev5%20MacOSX%20Yosemite%2010.10%20try%20debug%20test%20crashtest

I'll open a bug for the issue where debug tests take 1.5-2x longer on 10.10 as 10.8
They aren't freshly stolen 10.8 machines, because you would have heard me screaming, literally rather than seeing words typed in all-caps on IRC. We're desperately, horribly, terribly short of 10.8 slaves, a single loan is a huge event and I constantly badger the borrower until they give it back.

t-mavericks-r5-002 and 003 were Try build slaves, bug 946303, but actually I'm not sure which of the yosemite slaves those are, since in the 10.9 to 10.10 transition in bug 1094279 t-mavericks-r5-003 was already in the process of becoming 10.10 and it's not clear to me who became what. Assuming nothing else happened to it, bug 895628 leads me to believe that t-mavericks-r5-001 actually was a 10.8 slave, so it's good for it that I didn't realize that or I would have been screaming to get it back.

So one of the three is a 10.8 slave, and I haven't seen one clearly doing better than the other two, which probably means I can't blame the debug slow on them not being exactly the same spec as the 10.8 slaves.
Depends on: 1125998
Summary: Green up 10.10 tests currently failing on cedar → Green up 10.10 tests currently failing on try
Blocks: 1126493
Ah, bug 1059578, yosemite-0001 was mavericks-003 and thus is a former build slave. The times from when I retriggered three of a suite at once in https://treeherder.mozilla.org/#/jobs?repo=try&revision=74fe47408f2e seem pretty typical for the times the three generally get, so I'd say that yosemite-0002 was mavericks-002 and is the other former build slave, and yosemite-0003 was mavericks-001, the former 10.8 test slave, and that the "r5" for the 10.8 test slaves is a slower CPU, less memory, slower memory, or a slower disk than the "r5" for the build slaves.

Or mavericks just happened to get one broken 10.8 slave, it's a pretty small sample size.
The machines have the same config in terms of memory amount and speed and cpu if you run

system_profiler SPHardwareDataType

with the exception that yosemite-0001 has a different Boot ROM version.  If you run diskutil list to see the devices, and diskutil info /dev/disk1 etc the drives are different models.
Interesting, are they from back when you could choose between 5400 and 7200rpm disks? Being able to drop 10 minutes off a 45 minute job with a faster disk could be useful, though probably not useful enough to buy 125 new drives.
Depends on: 1128517
I dunno, 10 minutes * a zillion jobs adds up pretty fast.
or just go to SSD.  Seriously for $70 we could get 7200 rpm drives, that is <$10K.  obviously there is a cost to setup the drives and install them.  Imagine 40000 jobs/month @ 10 minutes that would yield 6600+ hours of time saved, that would really help our overall load.
We looked at using ssd drives for builders last year but it didn't show a performance improvement for builds.  Running tests on ssds wasn't investigated.  According catlee ram made a much bigger difference.

See
https://bugzilla.mozilla.org/show_bug.cgi?id=992378

Also, arr said that timeouts increased while running these builds.  

In any case, we'd have to measure the performance improvement before we go ahead and buy more drives.
Blocks: 1128567
Depends on: 1128955
Depends on: 1129045
Depends on: 1129300
I was going to call https://treeherder.mozilla.org/#/jobs?repo=try&revision=46fb2f20ed25 my report on the current status, but https://treeherder.mozilla.org/#/jobs?repo=try&revision=77db20550e97 is actually better, with one debug mochitest timing out because of teh slowz.
sowe just have debug m-3 to worry about and the general slowness of debug tests, right?  The rest of the issues are taken care of by adjusting manifests.
(In reply to Kim Moir [:kmoir] from comment #17)

This is likely just because one of these machines lost a drive and had to have it replaced. Both models are 500GB 7200 RPM 16MB cache.

talos-mtnlion-r5-001: H2T5001672S: http://www.memory4less.com/m4l_itemdetail.aspx?itemid=1465009711
t-yosemite-r5-0001: WD5000BPKT: http://www.memory4less.com/m4l_itemdetail.aspx?itemid=1464115165

The Hitachi actually has a slightly higher seek time (12ms vs 8.5ms) so the disk in t-yosemite-r5-0001 is actually marginally better.
(In reply to Kim Moir [:kmoir] from comment #14)
> Regarding comment #11 and the graph server.  I checked the graphs database
> and talos data was entered into it for 10.10 machines. I added a patch in
> bug 1125853 to add it to graphs/data.sql

I meant to get back to that, and then forgot until your talos run reminded me of why I wasn't getting talos runs on try, so I started getting them again and got another of the same sort of failure.

https://treeherder.mozilla.org/logviewer.html#?job_id=4876389&repo=try is nothing to do with graph server itself (which is, paradoxically, unusual for the "graph server unreachable" message which is almost always not about unreachable, but instead about something more like "perfectly reachable but doesn't like us only providing two bits of data about something where we're supposed to provide three"), but is actually about it being unreachable because the slave tries to resolve graphs.mozilla.org and fails,

16:08:42 INFO - INFO : Posting result 0 of 5 to http://graphs.mozilla.org/server/collect.cgi, attempt 0
16:09:47 INFO - INFO : Posting result 0 of 5 to http://graphs.mozilla.org/server/collect.cgi, attempt 1
16:10:57 INFO - INFO : Posting result 0 of 5 to http://graphs.mozilla.org/server/collect.cgi, attempt 2
16:12:17 INFO - INFO : Posting result 0 of 5 to http://graphs.mozilla.org/server/collect.cgi, attempt 3
16:13:57 INFO - INFO : Posting result 0 of 5 to http://graphs.mozilla.org/server/collect.cgi, attempt 4
16:16:17 INFO - DEBUG : Working with test: a11yr
16:16:17 INFO - WARNING: graph server unreachable
16:16:17 INFO - WARNING: [Errno 8] nodename nor servname provided, or not known
16:16:17 INFO - WARNING: graph server unreachable
16:16:17 INFO - WARNING: [Errno 8] nodename nor servname provided, or not known
16:16:17 INFO - WARNING: graph server unreachable
16:16:17 INFO - WARNING: [Errno 8] nodename nor servname provided, or not known
16:16:17 INFO - WARNING: graph server unreachable
16:16:17 INFO - WARNING: [Errno 8] nodename nor servname provided, or not known
16:16:17 INFO - WARNING: graph server unreachable
16:16:17 INFO - WARNING: [Errno 8] nodename nor servname provided, or not known

so still 0001 and so far only 0001 intermittently fails at DNS, and then either continues to fail or caches the failure.
Retriggered some and got 0002 to fail the same way. I haven't personally had any DNS troubles with 10.10, but I've heard of plenty, and in theory the minor updates have all been trying to fix various troubles. Is our image something later than it started at, or is it still pre-release?
The image is the 10.10 base image aka arwin Kernel Version 14.0.0 (not pre-release).  It seems strange that if it's a DNS issue it's not intermittently resolving graphs.mozilla.org versus other hosts i.e. while cloning etc
Depends on: 1134111
Depends on: 1134790
Removing another round of still-open-but-already-worked-around things that no longer block.
No longer depends on: 1122478, 1122992, 1123085, 1123195, 1125479, 1128517, 1129300
philor thank you so much for all your work getting all these tests greened up. They look so much better! 

We intend to only enable tests opt tests on 10.10 on trunk for now until bug 1125998 is resolved

The remaining failures for opt tests on try are as following:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d351255bed2e

-> these I think we can ignore these because they aren't enabled on other branches 
Rev5 MacOSX Yosemite 10.10 try opt test web-platform-tests-reftests
Rev5 MacOSX Yosemite 10.10 try opt test web-platform-tests-1
Rev5 MacOSX Yosemite 10.10 try opt test web-platform-tests-2
Rev5 MacOSX Yosemite 10.10 try opt test web-platform-tests-3

mochitest-5
1688 INFO TEST-UNEXPECTED-FAIL | toolkit/components/passwordmgr/test/test_basic_form_observer_autofillForms.html | Test timed out. - expected PASS

mochitest-other
2771 INFO TEST-UNEXPECTED-FAIL | widget/tests/test_native_mouse_mac.xul | received event I didn't expect: {"type":"mouseout","target":"box","screenX":150,"screenY":150} - expected PASS
2772 INFO TEST-UNEXPECTED-FAIL | widget/tests/test_native_mouse_mac.xul | Didn't receive expected event: {"type":"mousemove","target":"box","screenX":170,"screenY":150} - expected PASS
2773 INFO TEST-UNEXPECTED-FAIL | widget/tests/test_native_mouse_mac.xul | received event I didn't expect: {"type":"mouseover","target":"box","screenX":170,"screenY":150} - expected PASS

Should these tests be disabled? Or is there a fix? I would like to pull some slaves from the pool and reimage them as 10.10 soon so we can move forward with 10.10 testing on trunk. (See bug 1134223 and bug 1126493)
Flags: needinfo?(philringnalda)
First guess on the m-oth is that t-yosemite-r5-0004 should be disabled until we clear up whatever system dialog it wound up with over the top of the browser after the reimage - we have three failures in that try run, but that's three failures all on t-yosemite-r5-0004.

test_basic_form_observer_autofillForms.html doesn't excite me much so far, I'm willing to call it a random as yet unfiled timeout unless it comes back more frequently.
Flags: needinfo?(philringnalda)
I agree with philor on mochitest-5, that is pretty random and is probably a low frequency intermittent.

Regarding the m-oth stuff, this is really an issue where we are failing on a mac specific test.  This should be fixed sooner than later and what philor indicates about reimaging, etc. is good insight!

Right now we are overlooking debug m3.  That is failing and most likely due to the slow runtime.  Can we either not run debug m3 or schedule it and hide it by default as it will most likely fail all the time?
Joel we are only enabling 10.10 opt tests on trunk so we can continue to look at this m3 debug failure on try.  This is because of the issue in bug 1125998 where debug tests take so long.
ah, I get -- points for not reading everything!

ok, lets move forward, it sounds like we just have a few tests to disable in the oth tests temporarily.
No, we don't have tests to disable, we have reimage-related infra bustage to clean up. There are things we can just sweep under the carpet, and believe me I've swept a lot under it, but "some and only some slaves have something getting in the way of mouseover" is not something we can just sweep away. As the reimaged slaves come up, we need to run m-oth on them, on try or on a dev master, and only make the changeover using those ones which can demonstrate they don't have something crawling out over the top of the browser during tests.
Sigh, even less fun, it's not consistent, 0002 passed a run and then failed the next.

Could we be restoring apps after reboot, or putting up system crash report dialogs? We've done both before, and wound up with intermittent things-over-the-browser as a result.
Or, I could be certain it's infra and wrong about it being infra, we've done that plenty of times too.

Retriggered rather a lot on https://treeherder.mozilla.org/#/jobs?repo=try&revision=83c654620c54 where the test_platform_colors.xul failure is expected but everything else should be healthy.
Yes, I noticed that

m-other opt fails on
yosemite slaves 4, 2, 1, 5

and passes on 3 and 2

So it seems strange the the test pass and fail on the same host (yosemite 2) but at different times.

We are not restoring apps after reboot. 
I haven't seen crash system dialogs appear. Not sure how they are explicitly disabled at the os level
for buildduty: I reverted the patch to enabled 10.10 tests on trunk since philor  (bug 126493) is running tests to try to find out what is causing mochitest-other failures. So 10.10 won't be enabled on trunk if there is a reconfig So we are down 23 minis for 10.8 but the pending counts don't look that bad for tests, they are actually much worse for builds
Depends on: 1137575
Turns out I was wrong about it being a valued test, so you just need to land https://hg.mozilla.org/try/rev/4d66cd3d5498 everywhere, and move the goalposts on Wr and W(1,3), which actually are enabled everywhere now since he thought he was done greening them up, but surprise, his 10.8 ones are going away and his 10.10 ones will be hidden.

On the bright side, I noticed this evening that I haven't actually seen a talos DNS failure on 10.10.2 yet. Triggered several rounds on https://treeherder.mozilla.org/#/jobs?repo=try&revision=8ca60aa1b131 to see if that's just "yet."
(In reply to Phil Ringnalda (:philor) from comment #30)

Did we determine whether or not this issue was specific to this host (and if so, have we tried a basic reimage)? This is one of the used ones we just purchased, so if we're having issues with it, we want to determine that fairly quickly so we can get it repaired/returned under warranty.
All we've determined so far is that we're willing to throw the test under the bus, but it doesn't appear specific to, or even more than very vaguely more common on, a particular slave. Given three failures all on one slave, and one pass on a different slave, 'that slave is broken' seems like a fair bet, but 'that sample size was insufficient' is the only really safe bet.
Depends on: 1129771
Looks like recent debug tests run on try are green 
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2e8751506ae2

so I'm going to ask relops to reimage some more machines  (bug 1140246) so we can enable debug on trunk
I think we can close this bug.  Thanks so much philor for all your work greening up tests!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: nobody → kmoir
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: