Open Bug 1308472 Opened 6 years ago Updated 2 months ago

Intermittent 'make -k check' did not run successfully. Please check log for errors.

Categories

(Firefox Build System :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: intermittent-failure)

+++ This bug was initially created as a clone of Bug #1305227 +++

+++ This bug was initially created as a clone of Bug #1304593 +++

https://treeherder.mozilla.org/logviewer.html#?job_id=37243123&repo=mozilla-inbound

First occurrence seems to be https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=4531646

15:53:24     INFO - mozmake[3]: Leaving directory 'z:/task_1475675460/build/src/obj-firefox/browser/tools/mozscreenshots/mozscreenshots/extension'
15:53:24     INFO - mozmake[2]: Leaving directory 'z:/task_1475675460/build/src/obj-firefox/browser/tools/mozscreenshots'
15:53:24     INFO - mozmake[1]: Leaving directory 'z:/task_1475675460/build/src/obj-firefox/browser'
15:53:24    ERROR - Return code: 2
15:53:24     INFO - TinderboxPrint: check<br/>6167/0/0
15:53:24  WARNING - setting return code to 2
15:53:24    ERROR - 'make -k check' did not run successfully. Please check log for errors.
This looks like a make failure -- I don't see any previous errors in the logfile before mozmake returned 2.

Unless Rob or Pete recognize this as an issue with the new builders, I'd send this over to the build team?
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
there's been no recent changes to the builders but i note the referenced log file is for a pgo build which is something we've only been doing since bug 1304998 so if that's where all the failures are, it may be a problem with pgo configs.
Flags: needinfo?(rthijssen)
  https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308472&tree=all

The OS X failure is mis-starred.  But the rest are win32 pgo and win32 opt.

All I can find for ERRORs is:

08:36:59    ERROR - Return code: 1
09:34:05    ERROR - Return code: 2
09:34:05    ERROR - 'make -k check' did not run successfully. Please check log for errors.

so basically it looks like make is bombing out on its own, rather than reflecting an error from something it invoked.  At any rate, this is beyond my ability to debug with this limited information.
Component: General → Build Config
Flags: needinfo?(pmoore)
Product: Taskcluster → Core
Looking at one of those logs:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=37243123#L40422

The log parser isn't finding some of the errors. Specifically, there's stuff like:
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-unittest=-
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-unittest-level=error
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-html=-
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-html-level=error
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-tbpl=-
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-tbpl-level=error
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-xunit=-
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-xunit-level=error
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-errorsummary=-
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-errorsummary-level=error
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-mach=-
12:15:12     INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12     INFO - test.py: error: unrecognized arguments: --log-mach-level=error
12:15:12     INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmpqpdska') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034503D0>> ignored
12:15:12     INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmpro5zzh') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034CADB0>> ignored
12:15:12     INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmp9jlr4s') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034CA3D0>> ignored
12:15:12     INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmpq86gzl') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034CA830>> ignored
<...>
12:15:24     INFO - z:/task_1475834533/build/src/testing/testsuite-targets.mk:340: recipe for target 'check' failed
12:15:24     INFO - mozmake: *** [check] Error 1

Some of this might be fallout from bug 1304593.
Or bug 1304593 didn't fix the entire intermittent as "make -k check return code 2" was happening over there too. But these look like errors in the mozbase tests, so could be differences in how it's run. Given this is intermittent, I'm guessing the "unrecognized arguments" errors are not the problem (though should likely be fixed anyway). It's the WindowsError at the bottom that is causing the orange.
it's occurred to me that some of the race conditions and file locks we see in windows builds (like this one) might be due to problems with performance on the c: drive (because of the first-read ebs problems: bug 1305174). often when this comes up, the file in question is a temp file on the c: drive. whereas the build working directory for tc win builds, is on the faster ephemeral volumes (z:).

if this is the cause (or perhaps one of the causes) we could possibly address it in a couple different ways. we already have bug 1306989 against the generic worker to set the values of the temp environment variables to paths on z: but it's also possible to make the change in-tree with something like this: https://hg.mozilla.org/try/diff/48a500082516/taskcluster/taskgraph/transforms/job/mozharness.py which just sets the vars and creates the temp folders for the task user before the build starts.
I've now seen several failure logs from jobs where temp data was not on the c: drive, so I no longer think my theory in comment 7 is worth exploring.
:grenade, are you investigating this?  94 occurrences last week seems like a lot.
Flags: needinfo?(rthijssen)
i'm not looking at this. if it had been pgo only, then it could have been explained by a problem with pgo config, which i recently changed, but for make, we'll need build team expertise.
Flags: needinfo?(rthijssen)
this is an error setting up hg:
 13:49:45     INFO - [mozharness: 2016-10-15 13:49:45.751000Z] Skipping clobber step.

 13:49:45     INFO - [mozharness: 2016-10-15 13:49:45.751000Z] Running clone-tools step.

 13:49:45     INFO - Running pre-action listener: influxdb_recording_pre_action

 13:49:45     INFO - Running main action method: clone_tools

 13:49:45     INFO - retry: Calling _get_revision with args: (<mozharness.base.vcs.mercurial.MercurialVCS object at 0x0315AEB0>, 'z:\\task_1476538045\\build\\tools'), kwargs: {}, attempt #1

 13:49:45  WARNING - did not specify revision or branch; assuming "default"

 13:49:45     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--version']

 13:49:45     INFO - Copy/paste: hg --config ui.merge=internal:merge --version

 13:49:45     INFO -  Mercurial Distributed SCM (version 3.7.3)

 13:49:45     INFO -  (see https://mercurial-scm.org for more information)

 13:49:45     INFO -  Copyright (C) 2005-2016 Matt Mackall and others

 13:49:45     INFO -  This is free software; see the source for copying conditions. There is NO

 13:49:45     INFO -  warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

 13:49:45     INFO - Return code: 0

 13:49:45     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', 'debuginstall']

 13:49:45     INFO - Copy/paste: hg --config ui.merge=internal:merge debuginstall

 13:49:46     INFO -  checking encoding (cp1252)...

 13:49:46     INFO -  checking Python executable (C:\mozilla-build\python\Scripts\hg.exe)

 13:49:46     INFO -  checking Python version (2.7.11)

 13:49:46     INFO -  checking Python lib (C:\mozilla-build\python\lib)...

 13:49:46     INFO -  checking installed modules (C:\mozilla-build\python\Lib\site-packages\mercurial)...

 13:49:46     INFO -  checking templates (C:\mozilla-build\python\Lib\site-packages\mercurial\templates)...

 13:49:46     INFO -  checking commit editor...

 13:49:46     INFO -   No commit editor set and can't find vi in PATH

 13:49:46     INFO -   (specify a commit editor in your configuration file)

 13:49:46     INFO -  checking username...

 13:49:46     INFO -   no username supplied

 13:49:46     INFO -   (specify a username in your configuration file)

 13:49:46     INFO -  1 problems detected, please check your install!

 13:49:46    ERROR - Return code: 1

 13:49:46     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1476538045\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/build/tools', 'z:\\task_1476538045\\build\\tools', '--sharebase', 'C:\\builds\\hg-shared', '--branch', 'default']

 13:49:46     INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1476538045\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/build/tools z:\task_1476538045\build\tools --sharebase C:\builds\hg-shared --branch default

 13:49:55     INFO -  

 13:49:55     INFO - clone [=============================>                  ] 18057764/28698371 02s

 13:49:55     INFO - clone [=========================================>      ] 25260729/28698371 01s

 13:49:55     INFO -                                                                                

 13:49:55     INFO - ensuring https://hg.mozilla.org/build/tools@default is available at z:\task_1476538045\build\tools

 13:49:55     INFO -  (sharing from new pooled repository 7ae7fb134bf7aec6ec96a062ff47a69053dd2973)

 13:49:55     INFO -  applying clone bundle from https://s3-us-west-1.amazonaws.com/moz-hg-bundles-us-west-1/build/tools/78748858da198ca43c54765522a91ecebee0e075.packed1.hg

 13:49:55     INFO -  2021 files to transfer, 27.4 MB of data

 13:49:55     INFO -  transferred 27.4 MB in 4.1 seconds (6.66 MB/sec)

 13:49:55     INFO -  finished applying clone bundle

 13:49:55     INFO -  searching for changes

 13:49:55     INFO -  no changes found

 13:49:55     INFO -  searching for changes

 13:49:55     INFO -  no changes found

 13:49:55     INFO -  (pulling to obtain default)

 13:49:55     INFO -  (remote resolved default to 78748858da198ca43c54765522a91ecebee0e075; result is not deterministic)

13:49:55 INFO - (revision already present locally; not pulling) 


David, can you find a build peer to look at this?  I suspect it isn't build related but taskcluster related given the fact that this only fails on taskcluster and not on buildbot.
Flags: needinfo?(dburns)
Greg, since you have been working on TC lately, and your HG knowledge is great, can you have a quick look.
Flags: needinfo?(dburns) → needinfo?(gps)
i had no idea that the make error was hg related, but if it is, and seeing an hg 3.7.3 invocation in the log in comment 16, i would like to add here that there are two versions of hg on tc builders:
- 3.7.3 installed with mozilla-build at c:\mozilla-build\python\scripts\hg.exe
- 3.9.1 installed with mercurial inno-setup at c:\Program Files\Mercurial\hg.exe

currently the older version has precedence in the PATH environment variable and will be used if no explicit path to hg is provided in the hg command.
I don't think the hg error is related -- that's always occurred, and is only an issue when creating a commit.  But greg will know more.
The exceptions from comment #4 seem like the real intermittent failure. Possibly a poorly written test. I'm not sure what file/test they are occurring in, however.
Those are also red herrings, that sequence of errors show up in every log (see comment 8 for example).
I am unclear of the next steps here to resolve this- it appears I found an error in the logs which isn't an error (not sure why we don't fix that?) but the real failure is in something called test.py which from what I can figure out is related to mozbase unittests?  why are these part of the build?

it appears test.py is here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/test.py

we run a series of tests defined here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/test-manifest.ini

and it seems that we make it to the end of the tests (test_apk.ApkTest.test_with_package_name).

all of this is run from the command:
z:/task_1475834533/build/src/obj-firefox\_virtualenv\Scripts\python.exe z:/task_1475834533/build/src/testing/mozbase/test.py

:ahal, I see you reviewed a patch for pylint which touched a lot of mozbase code recently (possibly it landed a few days prior to this issue showing up):
https://hg.mozilla.org/mozilla-central/rev/f7b6a16eb1f6
Flags: needinfo?(ahalberstadt)
No, the mozbase linting landed well after this was filed. I could be wrong, but I don't think this is related to test.py. What makes you say that's where the real error is?
Flags: needinfo?(ahalberstadt)
comment 4 seems to be what ted and gps think is the root cause, which is test.py.

this bug is a bit silly since so many people have looked at the failure logs and cannot determine the root cause of the failure.

looking at what is failing, this is taskcluster windows builds only, not buildbot- If we cannot figure out the root cause, that is the place to start looking.
As I've said before, comment 4 happens in every single job, even green ones. Here's an arbitrary successful windows taskcluster build:
https://treeherder.mozilla.org/logviewer.html#?job_id=3656791&repo=autoland#L0-L34278

Ctrl-F for the error in comment 4 and you'll find it :). I believe the true cause of the error is not showing up in the logs.
got it, I thought that was in reference to the hg error I found.

going on the theory that we cannot find the root cause seems to be a valid summary.

Since this is build related, I would like :automatedtester to help find a build peer to find the root cause.
Flags: needinfo?(dburns)
clearing the n-i since there already is a n-i on a build peer
Flags: needinfo?(dburns)
There are now osx and linux failures too!

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1479887795/mozilla-inbound-macosx64-debug-bm86-build1-build6103.txt.gz

01:37:13     INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/python/mozbuild/mozbuild/test/configure/test_toolchain_configure.py | WindowsToolchainTest.test_unsupported_msvc
01:37:43     INFO - /builds/slave/m-in-m64-d-0000000000000000000/build/src/build/compare-mozconfig/compare-mozconfigs-wrapper.py
01:37:43     INFO - WARNING:__main__:if test `uname -s` != Linux; then not in macosx-universal [u'ac_add_options --enable-update-channel=nightly', u'ac_add_options --with-branding=browser/branding/nightly', u'ac_add_options --enable-profiling', u'mk_add_options CLIENT_PY_ARGS="--hg-options=\'--verbose --time\' --hgtool=../tools/buildfarm/utils/hgtool.py --skip-chatzilla --skip-comm --skip-inspector --tinderbox-print"', u'mk_add_options MOZ_MAKE_FLAGS="-j4"', u'if test "${MOZ_UPDATE_CHANNEL}" = "nightly"; then', u'ac_add_options --with-macbundlename-prefix=Firefox', u'fi', u'mk_add_options MOZ_MAKE_FLAGS="-j12"', u'ac_add_options --with-ccache', u'. "$topsrcdir/build/mozconfig.cache"', u'ac_add_options --disable-install-strip', u'ac_add_options --enable-instruments', u'ac_add_options --enable-dtrace']!
01:37:43     INFO - ERROR:__main__:found in /builds/slave/m-in-m64-d-0000000000000000000/build/src/browser/config/mozconfigs/macosx-universal/nightly but not in /builds/slave/m-in-m64-d-0000000000000000000/build/src/browser/config/mozconfigs/macosx-universal/beta: if test `uname -s` != Linux; then
01:37:43     INFO - ERROR:root:Mozconfig check failed!
01:37:43     INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/build/compare-mozconfig/compare-mozconfigs-wrapper.py | TestCompareMozconfigs.test_compare_mozconfigs
01:37:43     INFO - /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py
01:37:43     INFO - WARNING: Not a supported OS_TARGET for NSPR in moz.build: "". Use --with-system-nspr
01:37:43     INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py | TestMozbuildReading.test_filesystem_traversal_no_config
01:37:43     INFO - TEST-SKIP | /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py | TestMozbuildReading.test_filesystem_traversal_reading
01:37:43     INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py | TestMozbuildReading.test_orphan_file_patterns
01:37:43     INFO - make: *** [check] Error 245
And on linux, 

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=6952708#L20535

19:21:02 INFO - PROCESS-CRASH | test_assert.js | application crashed [@ xpcshell + 0x4b2a]
:mshal - The failure in comment 31 looks like a regression from bug 1183613; can you have a look?
Flags: needinfo?(mshal)
See Also: → 1318597
It looks like there are a few things going on here.

1) The compare-mozconfigs check fails
 - I introduced the OSX one in bug 1183613 as :gbrown noted
 - The Linux one is from bug 1278445 (back in July)

2) 'make check' still returns success even when compare-mozconfigs fails, which is why the errors above went unnoticed.
 - This was partly fixed I think by bug 1255479, but it missed one of the failure codepaths. The 'return ret_code' in compare-mozconfigs-wrapper.py should be an assert.

3) 'make -k check' randomly fails with strange error codes for no discernible reason (aka: this bug and bug 1318597). I still don't know the reason for this, nor can I reproduce it locally.

I'll file separate bugs for 1) & 2) and fix those, but I don't think that alone will fix this issue.
Flags: needinfo?(mshal)
See Also: → 1320761, 1320762
it has been a few weeks, this still is continuing- :mshal can you have a fresh look at this?
Flags: needinfo?(mshal)
Sure, I'll take another look. I haven't had much luck reproducing it locally yet, but I'm still trying. If not, the next best thing may be to just add some more debugging to see if we can get an actual error message to pinpoint the problem.
Flags: needinfo?(mshal)
I still haven't had any success in reproducing this locally. I'll file a bug to add some debugging, and hopefully on the next report we'll have some logs that give additional clues.
Depends on: 1325197
I did finally hit this locally, though it took a couple hundred runs. I had the patch in bug 1325197 applied, and saw this:

============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-2.9.2, py-1.4.31, pluggy-0.3.1 -- /home/marf/mozilla-central-git/obj-x86_64-pc-linux-gnu/_virtualenv/bin/python
cachedir: ../python/mozlint/.cache
rootdir: /home/marf/mozilla-central-git/python/mozlint, inifile:
collecting ... collected 5 items

../python/mozlint/test/test_types.py::test_linter_types[string.lint] PASSED
../python/mozlint/test/test_types.py::test_linter_types[regex.lint] PASSED
../python/mozlint/test/test_types.py::test_linter_types[external.lint] PASSED
../python/mozlint/test/test_types.py::test_linter_types[structured.lint] PASSED
../python/mozlint/test/test_types.py::test_no_filter PASSED

=========================== 5 passed in 0.16 seconds ===========================
Setting retcode to -139 from /home/marf/mozilla-central-git/python/mozlint/test/test_types.py


Which looks like test_types.py ran successfully, but then somehow the test harness gets a return code of -139. I was able to run test_types.py several thousand times successfully in isolation, so I'm not certain that this specific test is to blame. Maybe there's some funky interaction with running these in parallel.

In any case, if I hit it again I'll post more info.
Locally I've now hit this both in test_types.py and test_roller.py, which seems to match the logs since bug 1325197 landed. Specifically, test_roller.py failed in 7 cases and test_types.py failed in 13 cases.

I'm not sure if it's relevant, but these are 2 (out of 4) files that use pytest (instead of mozunit). In all cases the file that ends up setting the bad return code reports that all tests ran successfully, as in #c42.
On unix systems the return code comes from mozprocess here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#609

I think there are several issues being rolled up into this bug, because clicking a bunch of the windows failures in orangefactor show that make check fails before the python tests even get a chance to run. We should probably make those log lines from bug 1325197 into errors if the return code is non-zero, then we could make a new bug and star those failures over there.
Yeah, unfortunately the error message in this bug catches anything that causes 'make -k check' to fail. The ones I'm looking at specifically are the ones where we got no error message aside from make reporting the error, which now look to be specific to test_roller.py and test_types.py. Bug 1318597 catches these a lot better, but is specific to OSX since the error codes differ on other platforms.

I've been adding some more debug to try to figure out where the error code goes from 0 to something non-zero (since the test cases look like they pass). I also made a wrapper for the virtualenv python so I could see what the python subprocess itself was returning.

In one instance where I hit a test_types.py failure, it showed the return code as 0 right before calling sys.exit(rc). The python wrapper shell script then reports that the python process segfaults, and has a return value of 139 (on Linux - this varies per platform). So it looks like we're doing something funky in these two tests that triggers a rare python bug during cleanup.

ahal, any idea what test_types.py / test_roller.py would be doing differently than the other two mozlint tests? I still haven't seen a failure in test_formatters.py or test_parser.py
Flags: needinfo?(ahalberstadt)
Huh, that is bizarre. But I think your assessment might be correct. Both test_types.py and test_roller.py create a LintRoller() instance, whereas test_formatters.py and test_parser.py don't.

The LintRoller() object in turn uses the multiprocessing stdlib, so that must be related. I'll put aside some time this week to dig into this further.
Flags: needinfo?(ahalberstadt)
First, for some reason the ERROR log isn't showing up in treeherder as an error:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=69175359&lineNumber=16851

If we can get treeherder to highlight that line, it will be a lot easier to file a bug that can catch/isolate this error from the other stuff.

Second, I suspect that the multiprocessing manager is somehow involved:
https://dxr.mozilla.org/mozilla-central/source/python/mozlint/mozlint/roller.py#127

Mostly because we are segfaulting, and the manager is responsible for handling shared memory access. The only thing I can think to try is to explicitly call manager.shutdown() at the end (it is supposed to shutdown() when gc'ed though). I plan to move the mozlint tests into their own separate task once bug 1331663 is completed.. I'm tempted to wait until that happens to make testing this fix on try easier (no need to rebuild hundreds of times).

Alternatively, we could attempt to push that "fix" blindly and see if it works.
Flags: needinfo?(gps)
Where's the treeherder logic that determines what is an error? I'm totally find with just changing that message to match a filter better.

Though note that you'll see that error message if any of the python tests fail, so you'd have to specifically look for test_roller.py / test_types.py in order to better organize the bug reporting. The "Error 245" one seems to get the OSX ones very consistently at least.
(In reply to Michael Shal [:mshal] from comment #53)
> Where's the treeherder logic that determines what is an error? I'm totally
> find with just changing that message to match a filter better.

https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/parsers.py
I think the bug is that we *are* logging that at the logging.ERROR level, but by the time it gets to the log, it's only an INFO. There may be other error messages scattered around the tree being hidden this way too.

Eventually I'd also like to use mozlog there instead of mach's logging mechanism, but I have no timetable for that yet.
I looked at the 13 failures in #c62 to see if there has been any progress. We may want to narrow this bug to specifically track the test_types.py and test_roller.py failures that appear to succeed and then return failure. We get lots of false positives just by looking for "'make -k check' did not run successfully":

 - 8 instances of bug 1347414
 - 1 instance of bug 1349943
 - 1 seems to be an infra failure: make[2]: *** .deps/udata.o.pp: Input/output error.  Stop.
 - 1 appears to be mis-starred (it's a reftest failure, not a make check failure)

The two remaining are the failures that I think prompted this bug, namely that test_roller.py and/or test_types.py fail without any indication why. One of these was a linux64-asan build, and the other was windows2012-32-debug. The latter was more interesting, since it actually managed to produce some manner of stack trace:

08:59:11     INFO - ..\python\mozlint\test\test_types.py::test_linter_types[string.lint] PASSED
08:59:11     INFO - ..\python\mozlint\test\test_types.py::test_linter_types[regex.lint] PASSED
08:59:11     INFO - ..\python\mozlint\test\test_types.py::test_linter_types[external.lint] PASSED
08:59:11     INFO - ..\python\mozlint\test\test_types.py::test_linter_types[structured.lint] PASSED
08:59:11     INFO - ..\python\mozlint\test\test_types.py::test_no_filter FAILED
08:59:11     INFO - ================================== FAILURES ===================================
08:59:11     INFO - _______________________________ test_no_filter ________________________________
08:59:11     INFO - lint = <mozlint.roller.LintRoller object at 0x032E5150>
08:59:11     INFO - lintdir = 'z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\linters'
08:59:11     INFO - files = ['z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\files\\foobar.js', 'z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\files\\foobar.py', 'z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\files\\no_foobar.js']
08:59:11     INFO -     def test_no_filter(lint, lintdir, files):
08:59:11     INFO -         lint.read(os.path.join(lintdir, 'explicit_path.lint'))
08:59:11     INFO -         result = lint.roll(files)
08:59:11     INFO -         assert len(result) == 0
08:59:11     INFO -         lint.lintargs['use_filters'] = False
08:59:11     INFO - >       result = lint.roll(files)
08:59:11     INFO - ..\python\mozlint\test\test_types.py:45:
08:59:11     INFO - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
08:59:11     INFO - ..\python\mozlint\mozlint\roller.py:157: in roll
08:59:11     INFO -     m.shutdown()
08:59:11     INFO - c:\mozilla-build\python\Lib\multiprocessing\util.py:207: in __call__
08:59:11     INFO -     res = self._callback(*self._args, **self._kwargs)
08:59:11     INFO - c:\mozilla-build\python\Lib\multiprocessing\managers.py:625: in _finalize_manager
08:59:11     INFO -     process.terminate()
08:59:11     INFO - c:\mozilla-build\python\Lib\multiprocessing\process.py:137: in terminate
08:59:11     INFO -     self._popen.terminate()
08:59:11     INFO - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
08:59:11     INFO - self = <multiprocessing.forking.Popen object at 0x032E53D0>
08:59:11     INFO -     def terminate(self):
08:59:11     INFO -         if self.returncode is None:
08:59:11     INFO -             try:
08:59:11     INFO - >               _subprocess.TerminateProcess(int(self._handle), TERMINATE)
08:59:11     INFO - E               WindowsError: [Error 5] Access is denied
08:59:11     INFO - c:\mozilla-build\python\Lib\multiprocessing\forking.py:312: WindowsError
08:59:11     INFO - ===================== 1 failed, 4 passed in 6.11 seconds ======================
08:59:11     INFO - Setting retcode to 1 from z:\task_1489479150\build\src\python\mozlint\test\test_types.py

I'm not sure if this is a different problem, or if we are generally hitting similar multiprocessing errors but not getting a backtrace. The multiprocessing library fits ahal's theory in #c49, at least.

In any case, is there an easy way to configure the star-able recommendations to a regex or something? It'd be nice if this bug only tracked those with this message:

Setting retcode to .*test_types.py

and:

Setting retcode to .*test_roller.py

The actual return code does seem to still vary a bit - in these two failures, the Linux one returned -139 and the Windows one returned 1.
Depends on: 1352158
Product: Core → Firefox Build System
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.