Closed Bug 1436237 Opened 7 years ago Closed 6 years ago

Intermittent reftest Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1517946424\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1517946424\\build\\tests\\reftest\\runreftest.py'

Categories

(Testing :: Reftest, defect, P5)

Version 3
defect

Tracking

(firefox-esr60 fixed, firefox64 fixed, firefox65 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr60 --- fixed
firefox64 --- fixed
firefox65 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disable-recommended])

Filed by: rgurzau [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=160656014&repo=mozilla-inbound https://queue.taskcluster.net/v1/task/Og_9HnMORqmzoy7tmHD6Eg/runs/0/artifacts/public/logs/live_backing.log https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/Og_9HnMORqmzoy7tmHD6Eg/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 20:06:49 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1517946424\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1517946424\\build\\tests\\reftest\\runreftest.py', '--appname=Z:\\task_1517946424\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/TH62S7GNTVKzQzjapqX8zg/artifacts/public/build/target.crashreporter-symbols.zip', '--log-raw=Z:\\task_1517946424\\build\\blobber_upload_dir\\crashtest_raw.log', '--log-errorsummary=Z:\\task_1517946424\\build\\blobber_upload_dir\\crashtest_errorsummary.log', '--cleanup-crashes', '--marionette-startup-timeout=180', '--suite=crashtest', '--', 'tests/reftest/tests/testing/crashtest/crashtests.list'] [taskcluster 2018-02-06T20:47:07.613Z] Aborting task - max run time exceeded!
In the last 7 days, there have been 35 failures. Most of the failures are on windows10-64, and a few of them on windows10-64-ccov. Affected build types: debug, opt and pgo. An example of a recent log file: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=164292055&lineNumber=1778 And the relevant part of the log: 11:18:54 INFO - ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\genericworker\appdata\local\temp\tmpweg8qb.mozrunner\runreftest_leaks_tab_pid7796.log 11:18:54 INFO - [Child 7796, Main Thread] WARNING: No CID found when attempting to map contract ID: file z:/build/build/src/xpcom/components/nsComponentManager.cpp, line 508 11:18:59 INFO - --DOCSHELL 000001AD50760000 == 7 [pid = 1692] [id = {bf19be86-71be-4ee6-ba58-3e0737a37be4}] 11:18:59 INFO - --DOCSHELL 000001AD4CEC7000 == 6 [pid = 1692] [id = {76212ae8-d9a7-4e5c-a735-af8b48184559}] 11:18:59 INFO - --DOCSHELL 000001AD4FAE8000 == 5 [pid = 1692] [id = {61fd81ea-e623-4d94-a990-c3c345c00738}] 11:18:59 INFO - --DOCSHELL 000001AD52803000 == 4 [pid = 1692] [id = {2909ac40-b58a-4b02-870a-0ad7492dee21}] 11:19:00 INFO - --DOMWINDOW == 18 (000001AD50175C00) [pid = 1692] [serial = 8] [outer = 0000000000000000] [url = about:blank] 11:19:01 INFO - --DOCSHELL 0000021E8043F800 == 1 [pid = 5568] [id = {aeda12ed-9386-4d75-9417-332f2f52f82c}] 11:19:04 INFO - --DOMWINDOW == 17 (000001AD4D1E8400) [pid = 1692] [serial = 4] [outer = 0000000000000000] [url = about:blank] 11:19:04 INFO - --DOMWINDOW == 16 (000001AD52FA8000) [pid = 1692] [serial = 11] [outer = 0000000000000000] [url = about:blank] 11:19:04 INFO - --DOMWINDOW == 15 (000001AD4C7B4C00) [pid = 1692] [serial = 13] [outer = 0000000000000000] [url = about:blank] 11:19:04 INFO - --DOMWINDOW == 14 (000001AD4C7B6400) [pid = 1692] [serial = 14] [outer = 0000000000000000] [url = about:blank] 11:19:06 INFO - --DOMWINDOW == 5 (0000021E87F2C000) [pid = 5568] [serial = 5] [outer = 0000000000000000] [url = about:blank] 11:19:06 INFO - --DOMWINDOW == 4 (0000021E87B17C00) [pid = 5568] [serial = 2] [outer = 0000000000000000] [url = about:blank] 11:19:07 INFO - --DOMWINDOW == 13 (000001AD50972400) [pid = 1692] [serial = 9] [outer = 0000000000000000] [url = about:blank] 11:19:09 INFO - --DOMWINDOW == 3 (0000021E87B17800) [pid = 5568] [serial = 1] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html] 11:19:11 INFO - --DOMWINDOW == 2 (0000027E741F4400) [pid = 3572] [serial = 2] [outer = 0000000000000000] [url = about:blank] 11:19:11 INFO - --DOMWINDOW == 2 (0000021E87E92000) [pid = 5568] [serial = 3] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html] 11:19:13 INFO - --DOMWINDOW == 12 (000001AD5251A800) [pid = 1692] [serial = 10] [outer = 0000000000000000] [url = about:blank] 11:35:53 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1519643815\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1519643815\\build\\tests\\reftest\\runreftest.py', '--appname=Z:\\task_1519643815\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=Z:\\task_1519643815\\build\\symbols', '--log-raw=Z:\\task_1519643815\\build\\blobber_upload_dir\\crashtest_raw.log', '--log-errorsummary=Z:\\task_1519643815\\build\\blobber_upload_dir\\crashtest_errorsummary.log', '--cleanup-crashes', '--marionette-startup-timeout=180', '--suite=crashtest', '--', 'tests/reftest/tests/testing/crashtest/crashtests.list'] [taskcluster 2018-02-26T12:16:57.035Z] Exit Code: 0 [taskcluster 2018-02-26T12:16:57.035Z] User Time: 0s [taskcluster 2018-02-26T12:16:57.035Z] Kernel Time: 0s [taskcluster 2018-02-26T12:16:57.035Z] Wall Time: 59m54.3194423s [taskcluster 2018-02-26T12:16:57.035Z] Peak Memory: 4894720 [taskcluster 2018-02-26T12:16:57.035Z] Result: IDLENESS_LIMIT_EXCEEDED [taskcluster 2018-02-26T12:16:57.036Z] === Task Finished === [taskcluster 2018-02-26T12:16:57.036Z] Task Duration: 59m54.3239426s :ahal As you are the triage owner of this component, could you please take a look at this? Thank you!
Flags: needinfo?(ahalberstadt)
Whiteboard: [stockwell needswork]
:gbrown/:whimboo- this looks to be windows 10 reftests starting the browser and not starting the tests- not sure why we don't hit the 370 second timeout- maybe win10 has other issues and there is a bug in mozprocess?
Flags: needinfo?(hskupin)
Flags: needinfo?(gbrown)
Flags: needinfo?(ahalberstadt)
I think the key here is why don't we hit the 370 second timeout. If we hit that timeout, we'd get a crash report and screenshots and have a chance of finding a root cause. I'll make a quick check that we are setting a no-output timeout, but don't have many cycles for diving in here right now.
Flags: needinfo?(gbrown)
Using Windows 10 debug crashtests, I verified that mozprocess is using a 370 second output timeout when running firefox. I don't know what's going wrong.
Do reftests and crashtests set the mozcrashreporter environment variable to shutdown Firefox in case of content crashes? This looks kinda like we see a crash of the tab which drives the tests.
Flags: needinfo?(hskupin)
Oh, and we should really take a screenshot when such a situation happens. I think that is what we do with mochitests, and that helped us with the above situation.
(In reply to Geoff Brown [:gbrown] from comment #9) > Using Windows 10 debug crashtests, I verified that mozprocess is using a 370 > second output timeout when running firefox. I don't know what's going wrong. We may hang infinitely due to bug 1421289? If there is a crash and Firefox would quit itself, the result might be different.
Summary: Intermittent Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1517946424\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1517946424\\build\\tests\\reftest\\runreftest.py' → Intermittent reftest Automation Error: mozprocess timed out after 1000 seconds running ['Z:\\task_1517946424\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1517946424\\build\\tests\\reftest\\runreftest.py'
Sebastian, what can we do to make sure that this bug appears in the suggested bugs on Treeherder for such reftest failures? https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=071ee904485e21e19ca08456d32bce6825b77a26&selectedJob=171995989
Flags: needinfo?(aryx.bugmail)
The failure line in the log is: 01:09:35 INFO - Automation Error: mozprocess timed out after 1000 seconds running ['C:\\Users\\task_1522889138\\build\\venv\\Scripts\\python', '-u', 'C:\\Users\\task_1522889138\\build\\tests\\reftest\\runreftest.py', '--total-chunks', '2', '--this-chunk', '2', '--appname=C:\\Users\\task_1522889138\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/A7Uz_VKZS1qllD9i8-I1GQ/artifacts/public/build/target.crashreporter-symbols.zip', '--log-raw=C:\\Users\\task_1522889138\\build\\blobber_upload_dir\\reftest_raw.log', '--log-errorsummary=C:\\Users\\task_1522889138\\build\\blobber_upload_dir\\reftest_errorsummary.log', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=C:\\Users\\task_1522889138\\build', '--suite=reftest', '--', 'tests/reftest/tests/layout/reftests/reftest.list'] Few issues: * The error line contains parts that will always be unique (eg "task_1522889138" in the path), so never get correct suggestions. The harness should be only outputting relative paths in error messages. * It's a ridiculously long error line (albeit bug 1444831 is even worse). Changing to relative paths throughout would help shorten it. The errors suggestions logic trims the line to 100 characters so the rest never gets seen anyway (https://github.com/mozilla/treeherder/blob/856e96cf3f32008f2fd6197c993a7736fafcee82/treeherder/model/error_summary.py#L132-L140) * Ideally the mozharness error prefix for this line would be ERROR not INFO (though this doesn't affect the log parser, since it specifically checks for "Automation Error:", though we want to move away from hardcoding every error string variation in Treeherder)
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #24) > * The error line contains parts that will always be unique (eg > "task_1522889138" in the path), so never get correct suggestions. The > harness should be only outputting relative paths in error messages. Those lines are not from the individual harnesses but from mozharness. I'm not sure why we log those with the absolute path and not the relative one. Maybe to see which exact binaries are used? If we could agree to have relative paths as output by mozharness, we could indeed make things better for Treeherder. Chris, what do you think? > * Ideally the mozharness error prefix for this line would be ERROR not INFO > (though this doesn't affect the log parser, since it specifically checks for > "Automation Error:", though we want to move away from hardcoding every error > string variation in Treeherder) Would it help if we make the harness name part of the error line so we have at least an idea about which harness the automation error is about? It could be something like: "Automation Error: mozprocess timed out after 1000 seconds running reftest: %command%.
Flags: needinfo?(catlee)
(In reply to Henrik Skupin (:whimboo) from comment #25) > (In reply to Ed Morley [:emorley] from comment #24) > > * The error line contains parts that will always be unique (eg > > "task_1522889138" in the path), so never get correct suggestions. The > > harness should be only outputting relative paths in error messages. > > Those lines are not from the individual harnesses but from mozharness. I'm > not sure why we log those with the absolute path and not the relative one. > Maybe to see which exact binaries are used? If we could agree to have > relative paths as output by mozharness, we could indeed make things better > for Treeherder. > > Chris, what do you think? Without looking at the code, I would assume that we're logging the full commandline run, rather than making an explicit decision about whether to log absolute or relative paths. If we can switch mozharness to use relative paths for the commandline, then I'm sure the logging will be improved. > > * Ideally the mozharness error prefix for this line would be ERROR not INFO > > (though this doesn't affect the log parser, since it specifically checks for > > "Automation Error:", though we want to move away from hardcoding every error > > string variation in Treeherder) > > Would it help if we make the harness name part of the error line so we have > at least an idea about which harness the automation error is about? It could > be something like: "Automation Error: mozprocess timed out after 1000 > seconds running reftest: %command%. Yes, that would be much clearer IMO.
Flags: needinfo?(catlee)
As OF shows we only have those reftest hangs for QuantumRender those days: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-04-05&endday=2018-04-12&tree=trunk&bug=1436237 Also it happens after Marionette installed the extensions (specialpower, reftest). So I would assume that we see something similar to bug 1452913, where we might not handle a failure appropriately and call `DoneTests()`. As result we hang forever. Geoff, the reftest extension can log actions to a file when --log-file is used when invoking reftests, but I don't see it happening, or the artifact not uploaded. Could you check that and if necessary get it added? It will help us to identify the last step the extension did, and which step might have failed afterward.
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #28) > Geoff, the reftest extension can log actions to a file when --log-file is > used when invoking reftests, but I don't see it happening, or the artifact > not uploaded. Could you check that and if necessary get it added? It will > help us to identify the last step the extension did, and which step might > have failed afterward. https://treeherder.mozilla.org/#/jobs?repo=try&revision=fda489ab3fd6265850b224fb2c73de3c1b546693 Is this producing the log you are interested in, the "reftest-log.log" artifact? It looks to me like all of that information is already included in the normal raw log artifact (eg "crashtest_raw.log" for crashtests).
Flags: needinfo?(gbrown)
Oh, you are right. I totally missed that. Sorry. Also with your patch we have kinda huge logs ~17MB per test job due to debug level logging. I think that we are fine and don't need anything extra. So I have taken another look at the logs and noticed the following. For a successful build the following reftest extension log is seen right after: https://taskcluster-artifacts.net/SjXrJ914SDmH3bXODrKvzg/0/public/logs/log_info.log > 02:49:53 INFO - REFTEST INFO | Reading manifest file://C:\Users\task_1523587360\build\tests\reftest\tests\layout\reftests\reftest.list > 02:49:53 INFO - REFTEST INFO | Dumping JSON representation of sandbox Note that this is NOT happening in case of hangs! So it feels to me that the reftest extension might fail early during initialization/startup. Maybe something similar to what I fixed yesterday on bug 1452913. Here where it is called: https://dxr.mozilla.org/mozilla-central/rev/325ef357e5b73d63794e47c02c7f8e7cf58ccb48/layout/tools/reftest/reftest.jsm#391 The code lives in `ReadTests` which gets called at: https://dxr.mozilla.org/mozilla-central/rev/325ef357e5b73d63794e47c02c7f8e7cf58ccb48/layout/tools/reftest/reftest.jsm#306-312 Interesting is the check for `FOCUS_FILTER_NON_NEEDS_FOCUS_TESTS`. If `focusFilterMode` is different we register a listener for focus, which calls `ReadTests`. Please note that we would never reach `ReadTests` if the browser doesn't emit this event. The `focusFilterMode` is set via the preference `reftest.focusFilterMode` in the reftest harness which directly comes from the command line options: https://dxr.mozilla.org/mozilla-central/rev/325ef357e5b73d63794e47c02c7f8e7cf58ccb48/layout/tools/reftest/runreftest.py#304 And we don't set this option in mozharness: https://dxr.mozilla.org/mozilla-central/rev/325ef357e5b73d63794e47c02c7f8e7cf58ccb48/testing/mozharness/configs/unittests/win_taskcluster_unittest.py#121-135 As such it is None and I'm fairly sure that we hang due to the above described scenario. Firefox does not have focus. Geoff, I would suggest that we add another `logger.info()` call which explains that we wait for focus. I will file a new bug and provide a patch.
Btw the mochitest harness seems to log some more helpful messages in case the focus cannot be set: > 0:39.87 INFO Error: Unable to restore focus, expect failures and timeouts. [..] > 0:39.93 INFO must wait for load > 0:39.93 INFO must wait for focus Geoff, it looks like we might want to do some more work, which I don't have the time for. Maybe you could pick this up?
Flags: needinfo?(gbrown)
I actually filed bug 1453895 about that.
Depends on: 1453895
Flags: needinfo?(gbrown)
There are 41 failures in the last 7 days. They occur on 6 windows7-32 (pgo, debug), 17 windows10-64-qr (debug, opt), 2 windows10-64-ccov (debug), 15 windows10-64 (debug, opt, pgo), 1 osx-10-10 (debug). Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=207451732&repo=autoland&lineNumber=1210 ahal: Hi, Can you please point to someone that can have a look at this? Thank you!
Flags: needinfo?(ahal)
Whiteboard: [stockwell unknown] → [stockwell unknown][stockwell needswork]
Note that this became much more frequent around Oct 23. A more detailed regression window might be helpful.
Noting that no changes to the reftest harness or mozprocess landed between then and now.
Flags: needinfo?(ahal)
these are primarily windows 10, this is on hardware, not on vm. :pmoore- when did the generic-worker upgrade roll out to the windows hardware? :markco- were there other changes to windows hardware such as firmware upgrades, os tweaks, etc? specifically in the last few days/
Flags: needinfo?(pmoore)
Flags: needinfo?(mcornmesser)
Please note that in all the cases which I checked Marionette was able to install the reftest extension. So I assume that it's not getting correctly initialized and the tests started.
Also if it would be a configuration change for the Windows hardware I would also expect to see failures for other branches, but not only trunk: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-10-21&endday=2018-10-24&tree=all&page=1&page_size=100&bug=1436237 As this graph shows we had no failures before Oct 23rd. As such maybe one of those landed commits are related? https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2018-10-22&enddate=2018-10-24 As such I would suggest that we do a backfill.
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #54) > these are primarily windows 10, this is on hardware, not on vm. > > :pmoore- when did the generic-worker upgrade roll out to the windows > hardware? > :markco- were there other changes to windows hardware such as firmware > upgrades, os tweaks, etc? > > specifically in the last few days/ Oct 15th was the last generic-worker upgrade, and Oct 18th was the last significant change to the OS configuration. I am not aware of any changes to the firmware within the last few days. Dhouse?
Flags: needinfo?(pmoore)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)
Also note there is no Z: drive on Windows hardware nodes. I don't know if that is playing into the issue or not.
(In reply to Mark Cornmesser [:markco] from comment #57) > (In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #54) > > these are primarily windows 10, this is on hardware, not on vm. > > > > :pmoore- when did the generic-worker upgrade roll out to the windows > > hardware? > > :markco- were there other changes to windows hardware such as firmware > > upgrades, os tweaks, etc? > > > > specifically in the last few days/ > > Oct 15th was the last generic-worker upgrade, and Oct 18th was the last > significant change to the OS configuration. > > I am not aware of any changes to the firmware within the last few days. > Dhouse? No firmware changes or other changes to the hardware have been made within the last week.
Flags: needinfo?(dhouse)
(In reply to Henrik Skupin (:whimboo) from comment #56) > https://treeherder.mozilla.org/intermittent-failures.html#/ > bugdetails?startday=2018-10-21&endday=2018-10- > 24&tree=all&page=1&page_size=100&bug=1436237 > > As this graph shows we had no failures before Oct 23rd. As such maybe one of > those landed commits are related? https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-09-24&endday=2018-10-24&tree=trunk&page=1&page_size=20&bug=1436237 This looks like it started on Oct. 17th and didn't show up over the weekend.
There are 222 failures in the last 7 days. They occur on 112 windows10-64-qr (opt, debug) 99 windows10-64 (all build types) 6 windows10-64-ccov (debug) 2 windows10-64-ccov (debug) 2 windows10-64-devedition (opt) 1 windows10-64-nightly (opt) Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=209156247&repo=autoland&lineNumber=1003
I'm trying to narrow this down with debug logging on try. So far it looks like something is hanging in reftest.jsm.
Flags: needinfo?(gbrown)
This failure is easily reproduced on try, at least on windows-10/debug. https://treeherder.mozilla.org/#/jobs?repo=try&revision=31bb93611267c85883d29a0740cbfb75fe5cd32a Debug logging in that try push shows that this code is run for both success and failure cases: https://dxr.mozilla.org/mozilla-central/rev/c291143e24019097d087f9307e59b49facaf90cb/layout/tools/reftest/reftest.jsm#308-309 but ReadTests() is never called in the failure cases: reftests do not start because the focus event is never received. We identified this as an intermittent cause of this failure months ago, but didn't follow-up effectively: bug 1453895.
Flags: needinfo?(gbrown)
Thanks for looking into this Geoff!
Failures have become less frequent lately; I don't know why. Most/all of the remaining failures should go away with bug 1453895's landing (today).
The one remaining failure as filed after the patch on bug 1453895 landed is a crash due to a shutdown hang, and is not related to this bug at all: https://treeherder.mozilla.org/logviewer.html#?job_id=210599757&repo=autoland&lineNumber=29670-29678 I think it's safe to close this bug as fixed now.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.