Closed Bug 1210395 Opened 9 years ago Closed 9 years ago

"Green up" tests on OS X 10.10.5

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aselagea, Unassigned)

References

Details

Added several t-yosemite-r7 slaves to slavealloc, created and configured a test master on dev-master2.bb.releng.use1.mozilla.com and invoked sendchanges for several changesets from mozilla-central. 

The issue here is that some jobs still end up with warnings, most of them are web-platform tests (opt+debug), but also some mochitest and marionette jobs.
Jobs that finish with warnings:

	- both opt&debug
		test web-platform-tests-2
		test web-platform-tests-3
		test web-platform-tests-reftests
		test mochitest-devtools-chrome-2
		test mochitest-gl
	- opt: 
		test web-platform-tests-4
	- debug:
		test web-platform-tests-5
		test web-platform-tests-7
		test marionette		

Talked to Coop about this and we suspect that the issue might be related to moving from one OS version to another (10.10.2 --> 10.10.5). 

:jmaher we would appreciate it if you could take a look into this. Thanks!
Flags: needinfo?(jmaher)
can  you send me a link to try or other branch where I can view these?  Ideally try is the best so we can test fixes, retrigger to see patterns, etc.

I won't be able to fix everything- this is usually a 100-200 hour project, if it is less, I will make a bigger dent.
Flags: needinfo?(jmaher)
We know that openssl no longer allows SSLv3 connections, and that the cert checking stuff in python 2.7.10 has been tightened up. I'm wondering if some of the failures are due to one/both of these.

We were comparing two jobs, the first on an r5 that passed:

http://buildbot-master107.bb.releng.scl3.mozilla.com:8201/builders/Rev5%20MacOSX%20Yosemite%2010.10%20mozilla-central%20opt%20test%20web-platform-tests-2/builds/403/steps/run_script/logs/stdio

The second on an r7 that didn't (same changeset, same software):
http://dev-master2.bb.releng.use1.mozilla.com:8095/builders/Rev5%20MacOSX%20Yosemite%2010.10%20mozilla-central%20opt%20test%20web-platform-tests-2/builds/2/steps/run_script/logs/stdio

In this case and one we looked at before, the r7 does a runner_teardown and gives a WARNING around line 1363, but the r5 does not. In all other instances, a runner_teardown only happens if the test times out or if the end of the test run is reached. I'm wondering if this is significant (why is it doing a teardown here in the middle of the suite, after a successful test?)
Flags: needinfo?(winter2718)
keep in mind this could be an intermittent issue- I really don't see this on mozilla-inbound in the last 24 hours.  if you look in the log there is a 'test-unexpected-fail', this means that we have failures and need to address them unless they are intermittent.

looking at the failure, I see:
https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/meta/fetch/nosniff/image.html.ini?offset=400

note the specific hardcoding for osx 10.10.2, it sounds like we need to add 10.10.5 in there, there are 412 instances:
https://dxr.mozilla.org/mozilla-central/search?q=10.10.2&redirect=true&case=true&limit=51&offset=351

There is a tool to do this *automated* for web platform tests specifically if we push to try for all platforms and run ALL the web platform tests, then we can:
* testing/web-platform/update/fetchlogs.py
* ./mach web-paltform-tests-update /path/to/logs/*.log
* look at the diff of files, publish the diff for review, land, profit.

for mochitest/reftest/etc. it will be a different story.  I think the key is that we get these on try so we can look for patterns and use other tools.
Right, with wpt it's important to remember that there is metadata that determines the expected result and that this metadata is os version specific. So when adding a new platform we need to add new metadata.
(In reply to Amy Rich [:arr] [:arich] from comment #3)
> We know that openssl no longer allows SSLv3 connections, and that the cert
> checking stuff in python 2.7.10 has been tightened up. I'm wondering if some
> of the failures are due to one/both of these.
> 
> We were comparing two jobs, the first on an r5 that passed:
> 
> http://buildbot-master107.bb.releng.scl3.mozilla.com:8201/builders/
> Rev5%20MacOSX%20Yosemite%2010.10%20mozilla-central%20opt%20test%20web-
> platform-tests-2/builds/403/steps/run_script/logs/stdio
> 
> The second on an r7 that didn't (same changeset, same software):
> http://dev-master2.bb.releng.use1.mozilla.com:8095/builders/
> Rev5%20MacOSX%20Yosemite%2010.10%20mozilla-central%20opt%20test%20web-
> platform-tests-2/builds/2/steps/run_script/logs/stdio
> 
> In this case and one we looked at before, the r7 does a runner_teardown and
> gives a WARNING around line 1363, but the r5 does not. In all other
> instances, a runner_teardown only happens if the test times out or if the
> end of the test run is reached. I'm wondering if this is significant (why is
> it doing a teardown here in the middle of the suite, after a successful
> test?)

I'm not savvy on the way the test runners work. Maybe it uses different test runners for different test suites? I'm out of my depth on this one.
Flags: needinfo?(winter2718)
:kmoir pushed some try tests on yosemite-r7, more details:
- https://treeherder.mozilla.org/#/jobs?repo=try&revision=a479174a8763
- https://treeherder.mozilla.org/#/jobs?repo=try&revision=0084ba8c2789

After the tests were completed, I made a comparison between them:
Failing jobs:
-> mozilla-central opt test gtest
-> mozilla-central debug test gtest
Warning jobs:
-> mozilla-central opt test mochitest-devtools-chrome-2
-> mozilla-central opt test mochitest-gl
-> mozilla-central opt test web-platform-tests-2
-> mozilla-central opt test web-platform-tests-3
-> mozilla-central opt test web-platform-tests-4
-> mozilla-central opt test web-platform-tests-reftests
-> mozilla-central debug test mochitest-2
-> mozilla-central debug test mochitest-devtools-chrome-3
-> mozilla-central debug test mochitest-gl
-> mozilla-central debug test web-platform-tests-2
-> mozilla-central debug test web-platform-tests-3
gtest is 100% failing on every push on every tree on every OS, so you don't have to worry about that.
So I just got mass failures on a try push because of this. I don't understand why we're pushing these machines out to try rather than creating a pool just for some twig and using that to green them up.
That would be bug 1203128 rather than this, but the answer is that it didn't expect that to happen, it thought that it was requiring that you use "-u web-platform-tests[Ubuntu,10.8,10.10.5,Windows XP,Windows 7,Windows 8]" to wind up with the 10.10.5 slaves, and thought https://treeherder.mozilla.org/#/jobs?repo=try&revision=33bc0551337f would still get the 10.10 slaves.

("10.8"? Did you mean 10.6?)
(In reply to Vlad Ciobancai [:vladC] from comment #7)
> :kmoir pushed some try tests on yosemite-r7, more details:
> - https://treeherder.mozilla.org/#/jobs?repo=try&revision=a479174a8763
> - https://treeherder.mozilla.org/#/jobs?repo=try&revision=0084ba8c2789

Trychooser syntax needs to be "try: -b do -p macosx64 -u all[10.10.5] -t all[10.10.5]" with the doubled 10.10.5 that the website won't add to it for you, in order to get talos.
Looks like there a bug disabling 10.10.5 by default when 10.10 is explicitly specified. I've opened a bug 1212887 for this
Blocks: 1212887
from #ateam today

jmaher	jgraham: btw, did you get the osx 10.10.5 data from your try run?
	jgraham	jmaher: Yeah, but https://github.com/jgraham/treeherder_timeline was supposed to be a sea of green
	jgraham	Uh
	jgraham	https://treeherder.mozilla.org/#/jobs?repo=try&revision=9c325655c676
	jmaher	jgraham: looks like it isn't all green- is there anything I can do to help?
	jgraham	jmaher: Make try runs faster?
	jgraham	But I will fix up some of the remaining issues and see what another cycle brings
	jgraham	But now I have to leave
	kmoir	jgraham: it looks like some of these tests ran on r5 machines instead of r7
	jmaher	oh, -u [10.10], not -u [10.10.5]
	jgraham	Wait, did we fix that bug?
	jgraham	Last time [10.10] gave me 10.10.5
	kmoir	I fixed that bug on Friday
	kmoir	please specify 10.10.5
	jgraham	Oh, OK
	jmaher	jgraham: could I hack on the tool to not require windows?  maybe that would help speed things up
	jgraham	hanks for fixing the bug
Today I run a push try test on the new yosemite-r7 slaves, more details https://treeherder.mozilla.org/#/jobs?repo=try&revision=6b4a05498515

jmaher, jgraham: did you manage to find a resolution for the web platform tests?
Flags: needinfo?(jmaher)
Flags: needinfo?(james)
we are not there yet- the long turnaround on try (windows) means we context switch off and finally get around to it much later than optimal.  I recall that James was updating some tests and had to fix/remove those before getting to green- but 10.10.2 and 10.10.5 are looking very similar on the latest try push.
Flags: needinfo?(jmaher)
Yeah, I think I have "solved" the 10.10.5 problem, but it's mixed in with a pull of those tests from upstream which has introduced some issues. So I am working through those now. https://treeherder.mozilla.org/#/jobs?repo=try&revision=d009e6c5e85f was my last try push, but note that Wr on the base commit was bad, so there is less real orange than it seems.
Flags: needinfo?(james)
Blocks: 1184181
all done here.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
jmaher, is there a way to uplift the patches from bug 1216542 bug 1216549 bug 1216551 bug 1223372
to non-trunk branches?  We would like to totally decommission the r5 10.10 test machines.
Flags: needinfo?(jmaher)
I would be concerned about the patch in bug 1216542 as it touches the js source.  It was only necessary for code on trunk, so we could uplift the others and give it a try- not sure how to really try it out.  This could be done for Aurora, then the code could uplift to beta;  we still have mozilla-release and esr, likewise old school b2g branches- I assume those all run osx jobs.
Flags: needinfo?(jmaher)
I mistyped the bug, it is bug 1223372 which was js specific and related to a patch on trunk.  I have uplifted the other 3 patches from bugs 1216542, bug 1216549, and bug 1216551
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.