Closed
Bug 1308472
Opened 8 years ago
Closed 5 months ago
Intermittent 'make -k check' did not run successfully. Please check log for errors.
Categories
(Firefox Build System :: General, defect)
Firefox Build System
General
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: aryx, Unassigned)
References
Details
(Keywords: intermittent-failure)
+++ This bug was initially created as a clone of Bug #1305227 +++
+++ This bug was initially created as a clone of Bug #1304593 +++
https://treeherder.mozilla.org/logviewer.html#?job_id=37243123&repo=mozilla-inbound
First occurrence seems to be https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=4531646
15:53:24 INFO - mozmake[3]: Leaving directory 'z:/task_1475675460/build/src/obj-firefox/browser/tools/mozscreenshots/mozscreenshots/extension'
15:53:24 INFO - mozmake[2]: Leaving directory 'z:/task_1475675460/build/src/obj-firefox/browser/tools/mozscreenshots'
15:53:24 INFO - mozmake[1]: Leaving directory 'z:/task_1475675460/build/src/obj-firefox/browser'
15:53:24 ERROR - Return code: 2
15:53:24 INFO - TinderboxPrint: check<br/>6167/0/0
15:53:24 WARNING - setting return code to 2
15:53:24 ERROR - 'make -k check' did not run successfully. Please check log for errors.
Comment 1•8 years ago
|
||
This looks like a make failure -- I don't see any previous errors in the logfile before mozmake returned 2.
Unless Rob or Pete recognize this as an issue with the new builders, I'd send this over to the build team?
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
Comment 2•8 years ago
|
||
there's been no recent changes to the builders but i note the referenced log file is for a pgo build which is something we've only been doing since bug 1304998 so if that's where all the failures are, it may be a problem with pgo configs.
Flags: needinfo?(rthijssen)
Comment 3•8 years ago
|
||
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1308472&tree=all
The OS X failure is mis-starred. But the rest are win32 pgo and win32 opt.
All I can find for ERRORs is:
08:36:59 ERROR - Return code: 1
09:34:05 ERROR - Return code: 2
09:34:05 ERROR - 'make -k check' did not run successfully. Please check log for errors.
so basically it looks like make is bombing out on its own, rather than reflecting an error from something it invoked. At any rate, this is beyond my ability to debug with this limited information.
Component: General → Build Config
Flags: needinfo?(pmoore)
Product: Taskcluster → Core
Comment 4•8 years ago
|
||
Looking at one of those logs:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=37243123#L40422
The log parser isn't finding some of the errors. Specifically, there's stuff like:
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-unittest=-
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-unittest-level=error
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-html=-
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-html-level=error
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-tbpl=-
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-tbpl-level=error
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-xunit=-
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-xunit-level=error
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-errorsummary=-
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-errorsummary-level=error
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-mach=-
12:15:12 INFO - usage: test.py [-h] [--log-raw LOG_RAW] [--log-raw-level LOG_RAW_LEVEL]
12:15:12 INFO - test.py: error: unrecognized arguments: --log-mach-level=error
12:15:12 INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmpqpdska') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034503D0>> ignored
12:15:12 INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmpro5zzh') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034CADB0>> ignored
12:15:12 INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmp9jlr4s') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034CA3D0>> ignored
12:15:12 INFO - Exception WindowsError: (32, 'The process cannot access the file because it is being used by another process', 'c:\\users\\task_1475834533\\appdata\\local\\temp\\tmpq86gzl') in <bound method NamedTemporaryFile.__del__ of <mozfile.mozfile.NamedTemporaryFile object at 0x034CA830>> ignored
<...>
12:15:24 INFO - z:/task_1475834533/build/src/testing/testsuite-targets.mk:340: recipe for target 'check' failed
12:15:24 INFO - mozmake: *** [check] Error 1
Some of this might be fallout from bug 1304593.
Comment 5•8 years ago
|
||
Or bug 1304593 didn't fix the entire intermittent as "make -k check return code 2" was happening over there too. But these look like errors in the mozbase tests, so could be differences in how it's run. Given this is intermittent, I'm guessing the "unrecognized arguments" errors are not the problem (though should likely be fixed anyway). It's the WindowsError at the bottom that is causing the orange.
Comment hidden (Intermittent Failures Robot) |
Comment 7•8 years ago
|
||
it's occurred to me that some of the race conditions and file locks we see in windows builds (like this one) might be due to problems with performance on the c: drive (because of the first-read ebs problems: bug 1305174). often when this comes up, the file in question is a temp file on the c: drive. whereas the build working directory for tc win builds, is on the faster ephemeral volumes (z:).
if this is the cause (or perhaps one of the causes) we could possibly address it in a couple different ways. we already have bug 1306989 against the generic worker to set the values of the temp environment variables to paths on z: but it's also possible to make the change in-tree with something like this: https://hg.mozilla.org/try/diff/48a500082516/taskcluster/taskgraph/transforms/job/mozharness.py which just sets the vars and creates the temp folders for the task user before the build starts.
Comment 8•8 years ago
|
||
Actually, here is a green windows log that has the same messages as those in comment 4:
https://archive.mozilla.org/pub/firefox/try-builds/ahalberstadt@mozilla.com-b5bd0b383510b9afb5120bb90e2e2017762577c3/try-win32/try-win32-bm75-try1-build1403.txt.gz
So I think that's unrelated.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 11•8 years ago
|
||
I've now seen several failure logs from jobs where temp data was not on the c: drive, so I no longer think my theory in comment 7 is worth exploring.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 14•8 years ago
|
||
:grenade, are you investigating this? 94 occurrences last week seems like a lot.
Flags: needinfo?(rthijssen)
Comment 15•8 years ago
|
||
i'm not looking at this. if it had been pgo only, then it could have been explained by a problem with pgo config, which i recently changed, but for make, we'll need build team expertise.
Flags: needinfo?(rthijssen)
Comment 16•8 years ago
|
||
this is an error setting up hg:
13:49:45 INFO - [mozharness: 2016-10-15 13:49:45.751000Z] Skipping clobber step.
13:49:45 INFO - [mozharness: 2016-10-15 13:49:45.751000Z] Running clone-tools step.
13:49:45 INFO - Running pre-action listener: influxdb_recording_pre_action
13:49:45 INFO - Running main action method: clone_tools
13:49:45 INFO - retry: Calling _get_revision with args: (<mozharness.base.vcs.mercurial.MercurialVCS object at 0x0315AEB0>, 'z:\\task_1476538045\\build\\tools'), kwargs: {}, attempt #1
13:49:45 WARNING - did not specify revision or branch; assuming "default"
13:49:45 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--version']
13:49:45 INFO - Copy/paste: hg --config ui.merge=internal:merge --version
13:49:45 INFO - Mercurial Distributed SCM (version 3.7.3)
13:49:45 INFO - (see https://mercurial-scm.org for more information)
13:49:45 INFO - Copyright (C) 2005-2016 Matt Mackall and others
13:49:45 INFO - This is free software; see the source for copying conditions. There is NO
13:49:45 INFO - warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
13:49:45 INFO - Return code: 0
13:49:45 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', 'debuginstall']
13:49:45 INFO - Copy/paste: hg --config ui.merge=internal:merge debuginstall
13:49:46 INFO - checking encoding (cp1252)...
13:49:46 INFO - checking Python executable (C:\mozilla-build\python\Scripts\hg.exe)
13:49:46 INFO - checking Python version (2.7.11)
13:49:46 INFO - checking Python lib (C:\mozilla-build\python\lib)...
13:49:46 INFO - checking installed modules (C:\mozilla-build\python\Lib\site-packages\mercurial)...
13:49:46 INFO - checking templates (C:\mozilla-build\python\Lib\site-packages\mercurial\templates)...
13:49:46 INFO - checking commit editor...
13:49:46 INFO - No commit editor set and can't find vi in PATH
13:49:46 INFO - (specify a commit editor in your configuration file)
13:49:46 INFO - checking username...
13:49:46 INFO - no username supplied
13:49:46 INFO - (specify a username in your configuration file)
13:49:46 INFO - 1 problems detected, please check your install!
13:49:46 ERROR - Return code: 1
13:49:46 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1476538045\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/build/tools', 'z:\\task_1476538045\\build\\tools', '--sharebase', 'C:\\builds\\hg-shared', '--branch', 'default']
13:49:46 INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1476538045\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/build/tools z:\task_1476538045\build\tools --sharebase C:\builds\hg-shared --branch default
13:49:55 INFO -
13:49:55 INFO - clone [=============================> ] 18057764/28698371 02s
13:49:55 INFO - clone [=========================================> ] 25260729/28698371 01s
13:49:55 INFO -
13:49:55 INFO - ensuring https://hg.mozilla.org/build/tools@default is available at z:\task_1476538045\build\tools
13:49:55 INFO - (sharing from new pooled repository 7ae7fb134bf7aec6ec96a062ff47a69053dd2973)
13:49:55 INFO - applying clone bundle from https://s3-us-west-1.amazonaws.com/moz-hg-bundles-us-west-1/build/tools/78748858da198ca43c54765522a91ecebee0e075.packed1.hg
13:49:55 INFO - 2021 files to transfer, 27.4 MB of data
13:49:55 INFO - transferred 27.4 MB in 4.1 seconds (6.66 MB/sec)
13:49:55 INFO - finished applying clone bundle
13:49:55 INFO - searching for changes
13:49:55 INFO - no changes found
13:49:55 INFO - searching for changes
13:49:55 INFO - no changes found
13:49:55 INFO - (pulling to obtain default)
13:49:55 INFO - (remote resolved default to 78748858da198ca43c54765522a91ecebee0e075; result is not deterministic)
13:49:55 INFO - (revision already present locally; not pulling)
David, can you find a build peer to look at this? I suspect it isn't build related but taskcluster related given the fact that this only fails on taskcluster and not on buildbot.
Flags: needinfo?(dburns)
Comment 17•8 years ago
|
||
Greg, since you have been working on TC lately, and your HG knowledge is great, can you have a quick look.
Flags: needinfo?(dburns) → needinfo?(gps)
Comment 18•8 years ago
|
||
i had no idea that the make error was hg related, but if it is, and seeing an hg 3.7.3 invocation in the log in comment 16, i would like to add here that there are two versions of hg on tc builders:
- 3.7.3 installed with mozilla-build at c:\mozilla-build\python\scripts\hg.exe
- 3.9.1 installed with mercurial inno-setup at c:\Program Files\Mercurial\hg.exe
currently the older version has precedence in the PATH environment variable and will be used if no explicit path to hg is provided in the hg command.
Comment 19•8 years ago
|
||
I don't think the hg error is related -- that's always occurred, and is only an issue when creating a commit. But greg will know more.
Comment 20•8 years ago
|
||
The exceptions from comment #4 seem like the real intermittent failure. Possibly a poorly written test. I'm not sure what file/test they are occurring in, however.
Comment 21•8 years ago
|
||
Those are also red herrings, that sequence of errors show up in every log (see comment 8 for example).
Comment 22•8 years ago
|
||
I am unclear of the next steps here to resolve this- it appears I found an error in the logs which isn't an error (not sure why we don't fix that?) but the real failure is in something called test.py which from what I can figure out is related to mozbase unittests? why are these part of the build?
it appears test.py is here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/test.py
we run a series of tests defined here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/test-manifest.ini
and it seems that we make it to the end of the tests (test_apk.ApkTest.test_with_package_name).
all of this is run from the command:
z:/task_1475834533/build/src/obj-firefox\_virtualenv\Scripts\python.exe z:/task_1475834533/build/src/testing/mozbase/test.py
:ahal, I see you reviewed a patch for pylint which touched a lot of mozbase code recently (possibly it landed a few days prior to this issue showing up):
https://hg.mozilla.org/mozilla-central/rev/f7b6a16eb1f6
Flags: needinfo?(ahalberstadt)
Comment 23•8 years ago
|
||
No, the mozbase linting landed well after this was filed. I could be wrong, but I don't think this is related to test.py. What makes you say that's where the real error is?
Flags: needinfo?(ahalberstadt)
Comment 24•8 years ago
|
||
comment 4 seems to be what ted and gps think is the root cause, which is test.py.
this bug is a bit silly since so many people have looked at the failure logs and cannot determine the root cause of the failure.
looking at what is failing, this is taskcluster windows builds only, not buildbot- If we cannot figure out the root cause, that is the place to start looking.
Comment 25•8 years ago
|
||
As I've said before, comment 4 happens in every single job, even green ones. Here's an arbitrary successful windows taskcluster build:
https://treeherder.mozilla.org/logviewer.html#?job_id=3656791&repo=autoland#L0-L34278
Ctrl-F for the error in comment 4 and you'll find it :). I believe the true cause of the error is not showing up in the logs.
Comment 26•8 years ago
|
||
got it, I thought that was in reference to the hg error I found.
going on the theory that we cannot find the root cause seems to be a valid summary.
Since this is build related, I would like :automatedtester to help find a build peer to find the root cause.
Flags: needinfo?(dburns)
Comment 27•8 years ago
|
||
clearing the n-i since there already is a n-i on a build peer
Flags: needinfo?(dburns)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 31•8 years ago
|
||
There are now osx and linux failures too!
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1479887795/mozilla-inbound-macosx64-debug-bm86-build1-build6103.txt.gz
01:37:13 INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/python/mozbuild/mozbuild/test/configure/test_toolchain_configure.py | WindowsToolchainTest.test_unsupported_msvc
01:37:43 INFO - /builds/slave/m-in-m64-d-0000000000000000000/build/src/build/compare-mozconfig/compare-mozconfigs-wrapper.py
01:37:43 INFO - WARNING:__main__:if test `uname -s` != Linux; then not in macosx-universal [u'ac_add_options --enable-update-channel=nightly', u'ac_add_options --with-branding=browser/branding/nightly', u'ac_add_options --enable-profiling', u'mk_add_options CLIENT_PY_ARGS="--hg-options=\'--verbose --time\' --hgtool=../tools/buildfarm/utils/hgtool.py --skip-chatzilla --skip-comm --skip-inspector --tinderbox-print"', u'mk_add_options MOZ_MAKE_FLAGS="-j4"', u'if test "${MOZ_UPDATE_CHANNEL}" = "nightly"; then', u'ac_add_options --with-macbundlename-prefix=Firefox', u'fi', u'mk_add_options MOZ_MAKE_FLAGS="-j12"', u'ac_add_options --with-ccache', u'. "$topsrcdir/build/mozconfig.cache"', u'ac_add_options --disable-install-strip', u'ac_add_options --enable-instruments', u'ac_add_options --enable-dtrace']!
01:37:43 INFO - ERROR:__main__:found in /builds/slave/m-in-m64-d-0000000000000000000/build/src/browser/config/mozconfigs/macosx-universal/nightly but not in /builds/slave/m-in-m64-d-0000000000000000000/build/src/browser/config/mozconfigs/macosx-universal/beta: if test `uname -s` != Linux; then
01:37:43 INFO - ERROR:root:Mozconfig check failed!
01:37:43 INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/build/compare-mozconfig/compare-mozconfigs-wrapper.py | TestCompareMozconfigs.test_compare_mozconfigs
01:37:43 INFO - /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py
01:37:43 INFO - WARNING: Not a supported OS_TARGET for NSPR in moz.build: "". Use --with-system-nspr
01:37:43 INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py | TestMozbuildReading.test_filesystem_traversal_no_config
01:37:43 INFO - TEST-SKIP | /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py | TestMozbuildReading.test_filesystem_traversal_reading
01:37:43 INFO - TEST-PASS | /builds/slave/m-in-m64-d-0000000000000000000/build/src/config/tests/test_mozbuild_reading.py | TestMozbuildReading.test_orphan_file_patterns
01:37:43 INFO - make: *** [check] Error 245
![]() |
||
Comment 32•8 years ago
|
||
And on linux,
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=6952708#L20535
19:21:02 INFO - PROCESS-CRASH | test_assert.js | application crashed [@ xpcshell + 0x4b2a]
![]() |
||
Comment 33•8 years ago
|
||
:mshal - The failure in comment 31 looks like a regression from bug 1183613; can you have a look?
Flags: needinfo?(mshal)
Comment hidden (Intermittent Failures Robot) |
Comment 35•8 years ago
|
||
It looks like there are a few things going on here.
1) The compare-mozconfigs check fails
- I introduced the OSX one in bug 1183613 as :gbrown noted
- The Linux one is from bug 1278445 (back in July)
2) 'make check' still returns success even when compare-mozconfigs fails, which is why the errors above went unnoticed.
- This was partly fixed I think by bug 1255479, but it missed one of the failure codepaths. The 'return ret_code' in compare-mozconfigs-wrapper.py should be an assert.
3) 'make -k check' randomly fails with strange error codes for no discernible reason (aka: this bug and bug 1318597). I still don't know the reason for this, nor can I reproduce it locally.
I'll file separate bugs for 1) & 2) and fix those, but I don't think that alone will fix this issue.
Flags: needinfo?(mshal)
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 39•8 years ago
|
||
it has been a few weeks, this still is continuing- :mshal can you have a fresh look at this?
Flags: needinfo?(mshal)
Comment 40•8 years ago
|
||
Sure, I'll take another look. I haven't had much luck reproducing it locally yet, but I'm still trying. If not, the next best thing may be to just add some more debugging to see if we can get an actual error message to pinpoint the problem.
Flags: needinfo?(mshal)
Comment 41•8 years ago
|
||
I still haven't had any success in reproducing this locally. I'll file a bug to add some debugging, and hopefully on the next report we'll have some logs that give additional clues.
Comment 42•8 years ago
|
||
I did finally hit this locally, though it took a couple hundred runs. I had the patch in bug 1325197 applied, and saw this:
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-2.9.2, py-1.4.31, pluggy-0.3.1 -- /home/marf/mozilla-central-git/obj-x86_64-pc-linux-gnu/_virtualenv/bin/python
cachedir: ../python/mozlint/.cache
rootdir: /home/marf/mozilla-central-git/python/mozlint, inifile:
collecting ... collected 5 items
../python/mozlint/test/test_types.py::test_linter_types[string.lint] PASSED
../python/mozlint/test/test_types.py::test_linter_types[regex.lint] PASSED
../python/mozlint/test/test_types.py::test_linter_types[external.lint] PASSED
../python/mozlint/test/test_types.py::test_linter_types[structured.lint] PASSED
../python/mozlint/test/test_types.py::test_no_filter PASSED
=========================== 5 passed in 0.16 seconds ===========================
Setting retcode to -139 from /home/marf/mozilla-central-git/python/mozlint/test/test_types.py
Which looks like test_types.py ran successfully, but then somehow the test harness gets a return code of -139. I was able to run test_types.py several thousand times successfully in isolation, so I'm not certain that this specific test is to blame. Maybe there's some funky interaction with running these in parallel.
In any case, if I hit it again I'll post more info.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 45•8 years ago
|
||
Locally I've now hit this both in test_types.py and test_roller.py, which seems to match the logs since bug 1325197 landed. Specifically, test_roller.py failed in 7 cases and test_types.py failed in 13 cases.
I'm not sure if it's relevant, but these are 2 (out of 4) files that use pytest (instead of mozunit). In all cases the file that ends up setting the bad return code reports that all tests ran successfully, as in #c42.
Comment 46•8 years ago
|
||
On unix systems the return code comes from mozprocess here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#609
I think there are several issues being rolled up into this bug, because clicking a bunch of the windows failures in orangefactor show that make check fails before the python tests even get a chance to run. We should probably make those log lines from bug 1325197 into errors if the return code is non-zero, then we could make a new bug and star those failures over there.
Comment 47•8 years ago
|
||
Yeah, unfortunately the error message in this bug catches anything that causes 'make -k check' to fail. The ones I'm looking at specifically are the ones where we got no error message aside from make reporting the error, which now look to be specific to test_roller.py and test_types.py. Bug 1318597 catches these a lot better, but is specific to OSX since the error codes differ on other platforms.
I've been adding some more debug to try to figure out where the error code goes from 0 to something non-zero (since the test cases look like they pass). I also made a wrapper for the virtualenv python so I could see what the python subprocess itself was returning.
In one instance where I hit a test_types.py failure, it showed the return code as 0 right before calling sys.exit(rc). The python wrapper shell script then reports that the python process segfaults, and has a return value of 139 (on Linux - this varies per platform). So it looks like we're doing something funky in these two tests that triggers a rare python bug during cleanup.
ahal, any idea what test_types.py / test_roller.py would be doing differently than the other two mozlint tests? I still haven't seen a failure in test_formatters.py or test_parser.py
Flags: needinfo?(ahalberstadt)
Comment hidden (Intermittent Failures Robot) |
Comment 49•8 years ago
|
||
Huh, that is bizarre. But I think your assessment might be correct. Both test_types.py and test_roller.py create a LintRoller() instance, whereas test_formatters.py and test_parser.py don't.
The LintRoller() object in turn uses the multiprocessing stdlib, so that must be related. I'll put aside some time this week to dig into this further.
Flags: needinfo?(ahalberstadt)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 52•8 years ago
|
||
First, for some reason the ERROR log isn't showing up in treeherder as an error:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=69175359&lineNumber=16851
If we can get treeherder to highlight that line, it will be a lot easier to file a bug that can catch/isolate this error from the other stuff.
Second, I suspect that the multiprocessing manager is somehow involved:
https://dxr.mozilla.org/mozilla-central/source/python/mozlint/mozlint/roller.py#127
Mostly because we are segfaulting, and the manager is responsible for handling shared memory access. The only thing I can think to try is to explicitly call manager.shutdown() at the end (it is supposed to shutdown() when gc'ed though). I plan to move the mozlint tests into their own separate task once bug 1331663 is completed.. I'm tempted to wait until that happens to make testing this fix on try easier (no need to rebuild hundreds of times).
Alternatively, we could attempt to push that "fix" blindly and see if it works.
Flags: needinfo?(gps)
Comment 53•8 years ago
|
||
Where's the treeherder logic that determines what is an error? I'm totally find with just changing that message to match a filter better.
Though note that you'll see that error message if any of the python tests fail, so you'd have to specifically look for test_roller.py / test_types.py in order to better organize the bug reporting. The "Error 245" one seems to get the OSX ones very consistently at least.
Comment 54•8 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #53)
> Where's the treeherder logic that determines what is an error? I'm totally
> find with just changing that message to match a filter better.
https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/parsers.py
Comment 55•8 years ago
|
||
I think the bug is that we *are* logging that at the logging.ERROR level, but by the time it gets to the log, it's only an INFO. There may be other error messages scattered around the tree being hidden this way too.
Eventually I'd also like to use mozlog there instead of mach's logging mechanism, but I have no timetable for that yet.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 63•8 years ago
|
||
I looked at the 13 failures in #c62 to see if there has been any progress. We may want to narrow this bug to specifically track the test_types.py and test_roller.py failures that appear to succeed and then return failure. We get lots of false positives just by looking for "'make -k check' did not run successfully":
- 8 instances of bug 1347414
- 1 instance of bug 1349943
- 1 seems to be an infra failure: make[2]: *** .deps/udata.o.pp: Input/output error. Stop.
- 1 appears to be mis-starred (it's a reftest failure, not a make check failure)
The two remaining are the failures that I think prompted this bug, namely that test_roller.py and/or test_types.py fail without any indication why. One of these was a linux64-asan build, and the other was windows2012-32-debug. The latter was more interesting, since it actually managed to produce some manner of stack trace:
08:59:11 INFO - ..\python\mozlint\test\test_types.py::test_linter_types[string.lint] PASSED
08:59:11 INFO - ..\python\mozlint\test\test_types.py::test_linter_types[regex.lint] PASSED
08:59:11 INFO - ..\python\mozlint\test\test_types.py::test_linter_types[external.lint] PASSED
08:59:11 INFO - ..\python\mozlint\test\test_types.py::test_linter_types[structured.lint] PASSED
08:59:11 INFO - ..\python\mozlint\test\test_types.py::test_no_filter FAILED
08:59:11 INFO - ================================== FAILURES ===================================
08:59:11 INFO - _______________________________ test_no_filter ________________________________
08:59:11 INFO - lint = <mozlint.roller.LintRoller object at 0x032E5150>
08:59:11 INFO - lintdir = 'z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\linters'
08:59:11 INFO - files = ['z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\files\\foobar.js', 'z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\files\\foobar.py', 'z:\\task_1489479150\\build\\src\\python\\mozlint\\test\\files\\no_foobar.js']
08:59:11 INFO - def test_no_filter(lint, lintdir, files):
08:59:11 INFO - lint.read(os.path.join(lintdir, 'explicit_path.lint'))
08:59:11 INFO - result = lint.roll(files)
08:59:11 INFO - assert len(result) == 0
08:59:11 INFO - lint.lintargs['use_filters'] = False
08:59:11 INFO - > result = lint.roll(files)
08:59:11 INFO - ..\python\mozlint\test\test_types.py:45:
08:59:11 INFO - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
08:59:11 INFO - ..\python\mozlint\mozlint\roller.py:157: in roll
08:59:11 INFO - m.shutdown()
08:59:11 INFO - c:\mozilla-build\python\Lib\multiprocessing\util.py:207: in __call__
08:59:11 INFO - res = self._callback(*self._args, **self._kwargs)
08:59:11 INFO - c:\mozilla-build\python\Lib\multiprocessing\managers.py:625: in _finalize_manager
08:59:11 INFO - process.terminate()
08:59:11 INFO - c:\mozilla-build\python\Lib\multiprocessing\process.py:137: in terminate
08:59:11 INFO - self._popen.terminate()
08:59:11 INFO - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
08:59:11 INFO - self = <multiprocessing.forking.Popen object at 0x032E53D0>
08:59:11 INFO - def terminate(self):
08:59:11 INFO - if self.returncode is None:
08:59:11 INFO - try:
08:59:11 INFO - > _subprocess.TerminateProcess(int(self._handle), TERMINATE)
08:59:11 INFO - E WindowsError: [Error 5] Access is denied
08:59:11 INFO - c:\mozilla-build\python\Lib\multiprocessing\forking.py:312: WindowsError
08:59:11 INFO - ===================== 1 failed, 4 passed in 6.11 seconds ======================
08:59:11 INFO - Setting retcode to 1 from z:\task_1489479150\build\src\python\mozlint\test\test_types.py
I'm not sure if this is a different problem, or if we are generally hitting similar multiprocessing errors but not getting a backtrace. The multiprocessing library fits ahal's theory in #c49, at least.
In any case, is there an easy way to configure the star-able recommendations to a regex or something? It'd be nice if this bug only tracked those with this message:
Setting retcode to .*test_types.py
and:
Setting retcode to .*test_roller.py
The actual return code does seem to still vary a bit - in these two failures, the Linux one returned -139 and the Windows one returned 1.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Product: Core → Firefox Build System
Updated•2 years ago
|
Severity: normal → S3
Comment 71•5 months ago
|
||
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
For more information, please visit BugBot documentation.
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•