Open Bug 1572900 Opened 10 months ago Updated 21 hours ago

Intermittent REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output | application terminated with exit code 3221225477

Categories

(Testing :: Reftest, defect, P5)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell needswork:owner])

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=260889161&repo=mozilla-central
Full log: https://queue.taskcluster.net/v1/task/WsBv_DT8QC2FPQ0fZJ0hoA/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/WsBv_DT8QC2FPQ0fZJ0hoA/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


task 2019-08-10T00:36:46.503Z] 00:36:46 INFO - 1565397406500 Marionette DEBUG Init aborted (running=false, enabled=true, finalUIStartup=false)
[task 2019-08-10T00:36:46.650Z] 00:36:46 INFO - 1565397406645 Marionette TRACE Received observer notification toplevel-window-ready
[task 2019-08-10T00:42:56.656Z] 00:42:56 ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output
[task 2019-08-10T00:42:56.656Z] 00:42:56 ERROR - REFTEST ERROR | Force-terminating active process(es).
[task 2019-08-10T00:42:56.659Z] 00:42:56 INFO - REFTEST TEST-INFO | started process screenshot
[task 2019-08-10T00:42:56.726Z] 00:42:56 INFO - REFTEST TEST-INFO | screenshot: exit 0
[task 2019-08-10T00:42:56.741Z] 00:42:56 INFO - TEST-INFO | crashinject: exit 0
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - WARNING | IO Completion Port failed to signal process shutdown
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - Parent process 5572 exited with children alive:
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - PIDS: 3044
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - Attempting to kill them, but no guarantee of success
[task 2019-08-10T00:46:01.896Z] 00:46:01 ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 3221225477
[task 2019-08-10T00:46:01.915Z] 00:46:01 INFO - Traceback (most recent call last):
[task 2019-08-10T00:46:01.915Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 993, in <module>
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - sys.exit(run_test_harness(parser, options))
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 985, in run_test_harness
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - result = reftest.runTests(options.tests, options)
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 585, in runTests
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - return self.runSerialTests(manifests, options, cmdargs)
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 933, in runSerialTests
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - status = run(tests=tests)
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 910, in run
[task 2019-08-10T00:46:01.924Z] 00:46:01 INFO - **kwargs)
[task 2019-08-10T00:46:01.924Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 809, in runApp
[task 2019-08-10T00:46:01.924Z] 00:46:01 INFO - marionette.start_session()
[task 2019-08-10T00:46:01.925Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
[task 2019-08-10T00:46:01.925Z] 00:46:01 INFO - m._handle_socket_failure()
[task 2019-08-10T00:46:01.925Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\marionette.py", line 650, in _handle_socket_failure
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - reraise(exc, val, tb)
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - return func(*args, **kwargs)
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1108, in start_session
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - self.raise_for_port(timeout=timeout)
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\marionette.py", line 569, in raise_for_port
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - self.host, self.port))
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - socket.timeout: Timed out waiting for connection on 127.0.0.1:2828!
[task 2019-08-10T00:46:01.974Z] 00:46:01 ERROR - Return code: 1
[task 2019-08-10T00:46:01.975Z] 00:46:01 ERROR - No suite end message was emitted by this harness.
[task 2019-08-10T00:46:01.975Z] 00:46:01 INFO - TinderboxPrint: reftest-reftest<br/>4978/0/61

The spike in failure rate that was caused by the landing of bug 1611660 (specific to the windows7-32-mingwclang), was resolved later in the same day by the backout of bug 1611660.

We will make sure the issue causing the spike is resolved before re-landing bug 1611660. The existing low intermittent failure rate across various platforms, is an unrelated issue.

Flags: needinfo?(botond)
Flags: needinfo?(botond)
Whiteboard: [stockwell needswork:owner]

I have not been looking at this.

I did spend some time investigating the much larger spike that was caused by the landing of bug 1611660, and went away with its backout (see bug 1611660 comment 31 through bug 1611660 comment 35, though that investigation is also stalled at the moment), but that spike is specific to MinGW builds and may well be unrelated to the lower frequency intermittents that we're seeing independently of bug 1611660.

Flags: needinfo?(botond)

In the last 7 days there have been 26 occurrences on windows10-64 opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=297260460&repo=mozilla-central&lineNumber=3905

There have been 34 total failures in the last 7 days, affected platforms are windows10-aarch64 opt, windows10-64-shippable opt and windows10-64-ccov opt.
Recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298408304&repo=mozilla-central&lineNumber=3977

Kats, could you please redirect this to someone who can take a look?

Flags: needinfo?(kats)

:ahal, do you have an idea what might be happening here? Looking at the log linked in the previous comment, I see a gap of 370 seconds here:

[task 2020-04-20T14:35:47.277Z] 14:35:47     INFO - REFTEST INFO | Application command: Z:\task_1587392174\build\application\firefox\firefox.exe -marionette --wait-for-browser -profile c:\users\task_1587392174\appdata\local\temp\tmp0mz0wm.mozrunner
[task 2020-04-20T14:41:57.290Z] 14:41:57    ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output

which seems to imply that running the "application command" somehow failed to get firefox actually running. The log lines immediately after that seem to be triggered by the reftest harness attempting to force-terminate the process, and then there's a python stack dump which looks like marionette failing to connect to the firefox instance (which seems plausible since it never really got off the ground). So it seems to me like the firefox process either didn't get started at all, indicating a problem in whatever code is invoking that "application command", or the firefox process crashed immediately after startup. There's this additional logging:

[task 2020-04-20T14:45:02.444Z] 14:45:02     INFO -  Parent process 4644 exited with children alive:
[task 2020-04-20T14:45:02.444Z] 14:45:02     INFO -  PIDS: 3888

which seems to indicate the latter scenario - firefox was launched but terminated early. Does that seem like a reasonable interpretation of events here? If so, is there any way to get additional logging from the firefox instance that terminated, or get the exit code, minidump, or any sort of information that might shed light on why it terminated?

Flags: needinfo?(kats) → needinfo?(ahal)

Based on the symptoms, it's possible that what's happening here is the same thing that was happening at an increased frequency with the bug 1611660 patches.

I haven't gotten to the bottom of that (and I'm at a bit of a loss for how to get further), but the notes documenting my investigation in bug 1611660 comment 31 through bug 1611660 comment 35 may be helpful.

Thanks. I guess if nothing else I can do try pushes with those patches to trigger the error more often and try and debug via try pushes. But that's really an option of last resort.

Yeah, your assessment looks pretty accurate. Basically we're:

  1. Running the process:
    https://searchfox.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/base/runner.py#131

  2. The reftest harness captures output from the process:
    https://searchfox.org/mozilla-central/source/layout/tools/reftest/runreftest.py#798

  3. No output is detected for 370 seconds, triggering this function:
    https://searchfox.org/mozilla-central/source/layout/tools/reftest/runreftest.py#700

Any gecko output the Python side of the test harness receives should already be dumped to the log (see further up in the same log for some examples). Maybe it's possible to set MOZ_LOG to something more verbose? I wouldn't know how to do that though.

Without more logging, it's very hard to tell where were going wrong. We could try dumping proc.output in the timeoutHandler. Though I can't think of a reason that would contain anything that isn't in the main log.

Flags: needinfo?(ahal)

Permafailing is good, I'll do some try pushes to try and debug this. The most recent logs are slightly different than the one we examined earlier. In the new ones we see a few lines of marionette output before the 370 seconds of radio silence. e.g.:

[task 2020-04-30T00:30:44.493Z] 00:30:44     INFO - REFTEST INFO | Application command: C:\tasks\task_1588189014\build\application\firefox\firefox.exe -marionette --wait-for-browser -profile c:\users\testdroid.yoga-029.000\appdata\local\temp\tmp_hzjpb.mozrunner
[task 2020-04-30T00:30:45.238Z] 00:30:45     INFO - 1588206645226	Marionette	TRACE	Marionette enabled
[task 2020-04-30T00:30:45.303Z] 00:30:45     INFO - 1588206645296	Marionette	TRACE	Received observer notification toplevel-window-ready
[task 2020-04-30T00:36:55.314Z] 00:36:55    ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output

Sadly try pushes for aarch64 reftests take a long time to run since it has do some PGO stuff, so I can only do a couple of iterations a day. My current attempt is blowing up with my patch to turn on all MOZ_LOG calls. https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=df4b5c534400b5bfdc083c59ddf7064621d2031c

Will have to try something more targeted.

Quick update: I did a try push where I just emit the first 200 MOZ_LOGs of any module/level, in an attempt to see what the "last log before hang" was. 200 is too small though, since firefox emits the 200 logs and keeps going. I upped it to 1000 and that seemed to be a bit better, in that I'm seeing output of less than 1000 lines per process in the logfile. So I can dig in a bit more here and try and figure out the last thing that ran.

1000 wasn't actually enough. I was misled by MOZ_LOG_TEST getting called more times than MOZ_LOG. I will keep incrementing and doing try pushes to see if I can get the last log lines before the hang.

In the last 7 days there have been 41 occurrences, most on windows10-aarch64, windows10-64-ccov and linux1804-64-ccov build type opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=302611063&repo=mozilla-central&lineNumber=10994

[task 2020-05-16T23:46:30.286Z] 23:46:30     INFO - REFTEST TEST-START | image/test/reftest/pngsuite-palettes/ps2n2c16.png == image/test/reftest/pngsuite-palettes/ps2n2c16.html
[task 2020-05-16T23:46:30.287Z] 23:46:30     INFO - REFTEST TEST-LOAD | file:///C:/tasks/task_1589554637/build/tests/reftest/tests/image/test/reftest/pngsuite-palettes/ps2n2c16.png | 4 / 5 (80%)
[task 2020-05-16T23:46:30.348Z] 23:46:30     INFO - REFTEST TEST-LOAD | file:///C:/tasks/task_1589554637/build/tests/reftest/tests/image/test/reftest/pngsuite-palettes/ps2n2c16.html | 4 / 5 (80%)
[task 2020-05-16T23:46:30.439Z] 23:46:30     INFO - REFTEST TEST-PASS | image/test/reftest/pngsuite-palettes/ps2n2c16.png == image/test/reftest/pngsuite-palettes/ps2n2c16.html | image comparison, max difference: 0, number of differing pixels: 0
[task 2020-05-16T23:46:30.440Z] 23:46:30     INFO - REFTEST TEST-END | image/test/reftest/pngsuite-palettes/ps2n2c16.png == image/test/reftest/pngsuite-palettes/ps2n2c16.html
[task 2020-05-16T23:46:30.458Z] 23:46:30     INFO - REFTEST INFO | Slowest test took 114ms (file:///C:/tasks/task_1589554637/build/tests/reftest/tests/image/test/reftest/pngsuite-palettes/pp0n2c16.html)
[task 2020-05-16T23:46:30.459Z] 23:46:30     INFO - REFTEST INFO | Total canvas count = 2
[task 2020-05-16T23:46:30.658Z] 23:46:30     INFO - [Parent 1064, Gecko_IOThread] WARNING: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_win.cc, line 166
[task 2020-05-16T23:46:30.664Z] 23:46:30     INFO - JavaScript error: resource://gre/modules/UpdateTimerManager.jsm, line 352: TypeError: setting getter-only property "gLogEnabled"
[task 2020-05-16T23:46:30.665Z] 23:46:30     INFO - JavaScript error: resource://normandy/lib/RecipeRunner.jsm, line 290: NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS: [JavaScript Error: "setting getter-only property "gLogEnabled"" {file: "resource://gre/modules/UpdateTimerManager.jsm" line: 352}]'[JavaScript Error: "setting getter-only property "gLogEnabled"" {file: "resource://gre/modules/UpdateTimerManager.jsm" line: 352}]' when calling method: [nsIUpdateTimerManager::registerTimer]
[task 2020-05-16T23:46:30.848Z] 23:46:30     INFO - 1589672790840	Marionette	TRACE	Received observer notification xpcom-will-shutdown
[task 2020-05-16T23:46:30.848Z] 23:46:30     INFO - 1589672790841	Marionette	INFO	Stopped listening on port 2828
[task 2020-05-16T23:46:30.848Z] 23:46:30     INFO - 1589672790841	Marionette	DEBUG	Marionette stopped listening
[task 2020-05-16T23:46:30.908Z] 23:46:30     INFO - ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2020-05-16T23:46:30.933Z] 23:46:30     INFO - ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
[task 2020-05-16T23:46:31.140Z] 23:46:31     INFO - REFTEST INFO | Process mode: e10s
[task 2020-05-16T23:46:31.141Z] 23:46:31  WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
[task 2020-05-16T23:46:31.142Z] 23:46:31     INFO - REFTEST INFO | Running tests in file:///C:/tasks/task_1589554637/build/tests/reftest/tests/layout/reftests/forms/input/hidden/reftest.list
[task 2020-05-16T23:46:31.225Z] 23:46:31     INFO - REFTEST INFO | Running with e10s: True
[task 2020-05-16T23:46:31.225Z] 23:46:31     INFO - REFTEST INFO | Application command: C:\tasks\task_1589554637\build\application\firefox\firefox.exe -marionette --wait-for-browser -profile c:\users\testdr~1\appdata\local\temp\tmpzih_kx.mozrunner
[task 2020-05-16T23:46:31.954Z] 23:46:31     INFO - 1589672791946	Marionette	TRACE	Marionette enabled
[task 2020-05-16T23:46:32.024Z] 23:46:32     INFO - 1589672792017	Marionette	TRACE	Received observer notification toplevel-window-ready
[task 2020-05-16T23:52:42.029Z] 23:52:42    ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output
[task 2020-05-16T23:52:42.032Z] 23:52:42    ERROR - REFTEST ERROR | Force-terminating active process(es).
[task 2020-05-16T23:52:42.033Z] 23:52:42     INFO - REFTEST TEST-INFO | started process screenshot
[task 2020-05-16T23:52:42.278Z] 23:52:42     INFO - REFTEST TEST-INFO | screenshot: exit 0
[task 2020-05-16T23:52:42.300Z] 23:52:42     INFO -  TEST-INFO | crashinject: exit 0
[task 2020-05-16T23:55:47.357Z] 23:55:47     INFO -  WARNING | IO Completion Port failed to signal process shutdown
[task 2020-05-16T23:55:47.358Z] 23:55:47     INFO -  Parent process 6840 exited with children alive:
[task 2020-05-16T23:55:47.359Z] 23:55:47     INFO -  PIDS: 2808, 7444
[task 2020-05-16T23:55:47.360Z] 23:55:47     INFO -  Attempting to kill them, but no guarantee of success
[task 2020-05-16T23:55:47.420Z] 23:55:47    ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 3221225477
[task 2020-05-16T23:55:47.438Z] 23:55:47     INFO -  Traceback (most recent call last):
[task 2020-05-16T23:55:47.438Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 1038, in <module>
[task 2020-05-16T23:55:47.438Z] 23:55:47     INFO -      sys.exit(run_test_harness(parser, options))
[task 2020-05-16T23:55:47.439Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 1030, in run_test_harness
[task 2020-05-16T23:55:47.441Z] 23:55:47     INFO -      result = reftest.runTests(options.tests, options)
[task 2020-05-16T23:55:47.441Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 621, in runTests
[task 2020-05-16T23:55:47.446Z] 23:55:47     INFO -      return self.runSerialTests(manifests, options, cmdargs)
[task 2020-05-16T23:55:47.446Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 978, in runSerialTests
[task 2020-05-16T23:55:47.451Z] 23:55:47     INFO -      status = run(tests=tests)
[task 2020-05-16T23:55:47.451Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 952, in run
[task 2020-05-16T23:55:47.455Z] 23:55:47     INFO -      **kwargs)
[task 2020-05-16T23:55:47.455Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 884, in runApp
[task 2020-05-16T23:55:47.459Z] 23:55:47     INFO -      raise reraise(exc, value, tb)
[task 2020-05-16T23:55:47.460Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 847, in runApp
[task 2020-05-16T23:55:47.464Z] 23:55:47     INFO -      marionette.start_session()
[task 2020-05-16T23:55:47.464Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
[task 2020-05-16T23:55:47.464Z] 23:55:47     INFO -      m._handle_socket_failure()
[task 2020-05-16T23:55:47.465Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\marionette.py", line 653, in _handle_socket_failure
[task 2020-05-16T23:55:47.467Z] 23:55:47     INFO -      reraise(exc_cls, exc, tb)
[task 2020-05-16T23:55:47.467Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
[task 2020-05-16T23:55:47.467Z] 23:55:47     INFO -      return func(*args, **kwargs)
[task 2020-05-16T23:55:47.468Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1111, in start_session
[task 2020-05-16T23:55:47.472Z] 23:55:47     INFO -      self.raise_for_port(timeout=timeout)
[task 2020-05-16T23:55:47.472Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\marionette.py", line 572, in raise_for_port
[task 2020-05-16T23:55:47.474Z] 23:55:47     INFO -      self.host, self.port))
[task 2020-05-16T23:55:47.474Z] 23:55:47     INFO -  socket.timeout: Timed out waiting for connection on 127.0.0.1:2828!
[task 2020-05-16T23:55:47.646Z] 23:55:47    ERROR - Return code: 1
[task 2020-05-16T23:55:47.648Z] 23:55:47    ERROR - No suite end message was emitted by this harness.
[task 2020-05-16T23:55:47.648Z] 23:55:47     INFO - TinderboxPrint: reftest-reftest<br/>3058/0/13
[task 2020-05-16T23:55:47.648Z] 23:55:47    ERROR - # TBPL FAILURE #
[task 2020-05-16T23:55:47.649Z] 23:55:47  WARNING - setting return code to 2
[task 2020-05-16T23:55:47.649Z] 23:55:47    ERROR - The reftest suite: reftest ran with return status: FAILURE

This bug failed 38 times in the last 7 days. Occurs on windows10-aarch64 on opt build types.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=303563822&repo=mozilla-central&lineNumber=28705

Kats: Any updates on this bug?

Flags: needinfo?(kats)

Unfortunately no. The approach I was trying wasn't really working out, because the MOZ_LOG spew was too much and too nondeterministic, so I couldn't really compare from one run to the next and get a sense of where in firefox startup we were. I wasn't really sure how to proceed further, but I think maybe pursuing botond's "bisecting the hang" approach from this comment might be an avenue to explore. I'll try that next.

Flags: needinfo?(kats)

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=303756656&repo=try&lineNumber=10069 is a job from my most recent try push. Comparing a successful vs failed run shows that the nsWindowWatcher is creating the windows for chrome://gfxsanity/content/sanityparent.html and the chrome://browser/content/browser.xhtml documents, but it doesn't get as far as nsContentDLF::CreateDocument for those documents (whereas it does in the successful case). I'll continue bisecting in between those two points.

You need to log in before you can comment on or make changes to this bug.