Closed Bug 1572900 Opened 2 years ago Closed 27 days ago

Intermittent REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output | application terminated with exit code 3221225477

Categories

(Testing :: Reftest, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 3 open bugs, Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [comment 113][stockwell unknown])

Attachments

(1 file)

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=260889161&repo=mozilla-central
Full log: https://queue.taskcluster.net/v1/task/WsBv_DT8QC2FPQ0fZJ0hoA/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/WsBv_DT8QC2FPQ0fZJ0hoA/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


task 2019-08-10T00:36:46.503Z] 00:36:46 INFO - 1565397406500 Marionette DEBUG Init aborted (running=false, enabled=true, finalUIStartup=false)
[task 2019-08-10T00:36:46.650Z] 00:36:46 INFO - 1565397406645 Marionette TRACE Received observer notification toplevel-window-ready
[task 2019-08-10T00:42:56.656Z] 00:42:56 ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output
[task 2019-08-10T00:42:56.656Z] 00:42:56 ERROR - REFTEST ERROR | Force-terminating active process(es).
[task 2019-08-10T00:42:56.659Z] 00:42:56 INFO - REFTEST TEST-INFO | started process screenshot
[task 2019-08-10T00:42:56.726Z] 00:42:56 INFO - REFTEST TEST-INFO | screenshot: exit 0
[task 2019-08-10T00:42:56.741Z] 00:42:56 INFO - TEST-INFO | crashinject: exit 0
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - WARNING | IO Completion Port failed to signal process shutdown
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - Parent process 5572 exited with children alive:
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - PIDS: 3044
[task 2019-08-10T00:46:01.741Z] 00:46:01 INFO - Attempting to kill them, but no guarantee of success
[task 2019-08-10T00:46:01.896Z] 00:46:01 ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 3221225477
[task 2019-08-10T00:46:01.915Z] 00:46:01 INFO - Traceback (most recent call last):
[task 2019-08-10T00:46:01.915Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 993, in <module>
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - sys.exit(run_test_harness(parser, options))
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 985, in run_test_harness
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - result = reftest.runTests(options.tests, options)
[task 2019-08-10T00:46:01.918Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 585, in runTests
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - return self.runSerialTests(manifests, options, cmdargs)
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 933, in runSerialTests
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - status = run(tests=tests)
[task 2019-08-10T00:46:01.921Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 910, in run
[task 2019-08-10T00:46:01.924Z] 00:46:01 INFO - **kwargs)
[task 2019-08-10T00:46:01.924Z] 00:46:01 INFO - File "Z:\task_1565391448\build\tests\reftest\runreftest.py", line 809, in runApp
[task 2019-08-10T00:46:01.924Z] 00:46:01 INFO - marionette.start_session()
[task 2019-08-10T00:46:01.925Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
[task 2019-08-10T00:46:01.925Z] 00:46:01 INFO - m._handle_socket_failure()
[task 2019-08-10T00:46:01.925Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\marionette.py", line 650, in _handle_socket_failure
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - reraise(exc, val, tb)
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - return func(*args, **kwargs)
[task 2019-08-10T00:46:01.926Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1108, in start_session
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - self.raise_for_port(timeout=timeout)
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - File "Z:\task_1565391448\build\venv\lib\site-packages\marionette_driver\marionette.py", line 569, in raise_for_port
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - self.host, self.port))
[task 2019-08-10T00:46:01.929Z] 00:46:01 INFO - socket.timeout: Timed out waiting for connection on 127.0.0.1:2828!
[task 2019-08-10T00:46:01.974Z] 00:46:01 ERROR - Return code: 1
[task 2019-08-10T00:46:01.975Z] 00:46:01 ERROR - No suite end message was emitted by this harness.
[task 2019-08-10T00:46:01.975Z] 00:46:01 INFO - TinderboxPrint: reftest-reftest<br/>4978/0/61

The spike in failure rate that was caused by the landing of bug 1611660 (specific to the windows7-32-mingwclang), was resolved later in the same day by the backout of bug 1611660.

We will make sure the issue causing the spike is resolved before re-landing bug 1611660. The existing low intermittent failure rate across various platforms, is an unrelated issue.

Flags: needinfo?(botond)

I have not been looking at this.

I did spend some time investigating the much larger spike that was caused by the landing of bug 1611660, and went away with its backout (see bug 1611660 comment 31 through bug 1611660 comment 35, though that investigation is also stalled at the moment), but that spike is specific to MinGW builds and may well be unrelated to the lower frequency intermittents that we're seeing independently of bug 1611660.

Flags: needinfo?(botond)

In the last 7 days there have been 26 occurrences on windows10-64 opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=297260460&repo=mozilla-central&lineNumber=3905

There have been 34 total failures in the last 7 days, affected platforms are windows10-aarch64 opt, windows10-64-shippable opt and windows10-64-ccov opt.
Recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298408304&repo=mozilla-central&lineNumber=3977

Kats, could you please redirect this to someone who can take a look?

Flags: needinfo?(kats)

:ahal, do you have an idea what might be happening here? Looking at the log linked in the previous comment, I see a gap of 370 seconds here:

[task 2020-04-20T14:35:47.277Z] 14:35:47     INFO - REFTEST INFO | Application command: Z:\task_1587392174\build\application\firefox\firefox.exe -marionette --wait-for-browser -profile c:\users\task_1587392174\appdata\local\temp\tmp0mz0wm.mozrunner
[task 2020-04-20T14:41:57.290Z] 14:41:57    ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output

which seems to imply that running the "application command" somehow failed to get firefox actually running. The log lines immediately after that seem to be triggered by the reftest harness attempting to force-terminate the process, and then there's a python stack dump which looks like marionette failing to connect to the firefox instance (which seems plausible since it never really got off the ground). So it seems to me like the firefox process either didn't get started at all, indicating a problem in whatever code is invoking that "application command", or the firefox process crashed immediately after startup. There's this additional logging:

[task 2020-04-20T14:45:02.444Z] 14:45:02     INFO -  Parent process 4644 exited with children alive:
[task 2020-04-20T14:45:02.444Z] 14:45:02     INFO -  PIDS: 3888

which seems to indicate the latter scenario - firefox was launched but terminated early. Does that seem like a reasonable interpretation of events here? If so, is there any way to get additional logging from the firefox instance that terminated, or get the exit code, minidump, or any sort of information that might shed light on why it terminated?

Flags: needinfo?(kats) → needinfo?(ahal)

Based on the symptoms, it's possible that what's happening here is the same thing that was happening at an increased frequency with the bug 1611660 patches.

I haven't gotten to the bottom of that (and I'm at a bit of a loss for how to get further), but the notes documenting my investigation in bug 1611660 comment 31 through bug 1611660 comment 35 may be helpful.

Thanks. I guess if nothing else I can do try pushes with those patches to trigger the error more often and try and debug via try pushes. But that's really an option of last resort.

Yeah, your assessment looks pretty accurate. Basically we're:

  1. Running the process:
    https://searchfox.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/base/runner.py#131

  2. The reftest harness captures output from the process:
    https://searchfox.org/mozilla-central/source/layout/tools/reftest/runreftest.py#798

  3. No output is detected for 370 seconds, triggering this function:
    https://searchfox.org/mozilla-central/source/layout/tools/reftest/runreftest.py#700

Any gecko output the Python side of the test harness receives should already be dumped to the log (see further up in the same log for some examples). Maybe it's possible to set MOZ_LOG to something more verbose? I wouldn't know how to do that though.

Without more logging, it's very hard to tell where were going wrong. We could try dumping proc.output in the timeoutHandler. Though I can't think of a reason that would contain anything that isn't in the main log.

Flags: needinfo?(ahal)

Permafailing is good, I'll do some try pushes to try and debug this. The most recent logs are slightly different than the one we examined earlier. In the new ones we see a few lines of marionette output before the 370 seconds of radio silence. e.g.:

[task 2020-04-30T00:30:44.493Z] 00:30:44     INFO - REFTEST INFO | Application command: C:\tasks\task_1588189014\build\application\firefox\firefox.exe -marionette --wait-for-browser -profile c:\users\testdroid.yoga-029.000\appdata\local\temp\tmp_hzjpb.mozrunner
[task 2020-04-30T00:30:45.238Z] 00:30:45     INFO - 1588206645226	Marionette	TRACE	Marionette enabled
[task 2020-04-30T00:30:45.303Z] 00:30:45     INFO - 1588206645296	Marionette	TRACE	Received observer notification toplevel-window-ready
[task 2020-04-30T00:36:55.314Z] 00:36:55    ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output

Sadly try pushes for aarch64 reftests take a long time to run since it has do some PGO stuff, so I can only do a couple of iterations a day. My current attempt is blowing up with my patch to turn on all MOZ_LOG calls. https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=df4b5c534400b5bfdc083c59ddf7064621d2031c

Will have to try something more targeted.

Quick update: I did a try push where I just emit the first 200 MOZ_LOGs of any module/level, in an attempt to see what the "last log before hang" was. 200 is too small though, since firefox emits the 200 logs and keeps going. I upped it to 1000 and that seemed to be a bit better, in that I'm seeing output of less than 1000 lines per process in the logfile. So I can dig in a bit more here and try and figure out the last thing that ran.

1000 wasn't actually enough. I was misled by MOZ_LOG_TEST getting called more times than MOZ_LOG. I will keep incrementing and doing try pushes to see if I can get the last log lines before the hang.

In the last 7 days there have been 41 occurrences, most on windows10-aarch64, windows10-64-ccov and linux1804-64-ccov build type opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=302611063&repo=mozilla-central&lineNumber=10994

[task 2020-05-16T23:46:30.286Z] 23:46:30     INFO - REFTEST TEST-START | image/test/reftest/pngsuite-palettes/ps2n2c16.png == image/test/reftest/pngsuite-palettes/ps2n2c16.html
[task 2020-05-16T23:46:30.287Z] 23:46:30     INFO - REFTEST TEST-LOAD | file:///C:/tasks/task_1589554637/build/tests/reftest/tests/image/test/reftest/pngsuite-palettes/ps2n2c16.png | 4 / 5 (80%)
[task 2020-05-16T23:46:30.348Z] 23:46:30     INFO - REFTEST TEST-LOAD | file:///C:/tasks/task_1589554637/build/tests/reftest/tests/image/test/reftest/pngsuite-palettes/ps2n2c16.html | 4 / 5 (80%)
[task 2020-05-16T23:46:30.439Z] 23:46:30     INFO - REFTEST TEST-PASS | image/test/reftest/pngsuite-palettes/ps2n2c16.png == image/test/reftest/pngsuite-palettes/ps2n2c16.html | image comparison, max difference: 0, number of differing pixels: 0
[task 2020-05-16T23:46:30.440Z] 23:46:30     INFO - REFTEST TEST-END | image/test/reftest/pngsuite-palettes/ps2n2c16.png == image/test/reftest/pngsuite-palettes/ps2n2c16.html
[task 2020-05-16T23:46:30.458Z] 23:46:30     INFO - REFTEST INFO | Slowest test took 114ms (file:///C:/tasks/task_1589554637/build/tests/reftest/tests/image/test/reftest/pngsuite-palettes/pp0n2c16.html)
[task 2020-05-16T23:46:30.459Z] 23:46:30     INFO - REFTEST INFO | Total canvas count = 2
[task 2020-05-16T23:46:30.658Z] 23:46:30     INFO - [Parent 1064, Gecko_IOThread] WARNING: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_win.cc, line 166
[task 2020-05-16T23:46:30.664Z] 23:46:30     INFO - JavaScript error: resource://gre/modules/UpdateTimerManager.jsm, line 352: TypeError: setting getter-only property "gLogEnabled"
[task 2020-05-16T23:46:30.665Z] 23:46:30     INFO - JavaScript error: resource://normandy/lib/RecipeRunner.jsm, line 290: NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS: [JavaScript Error: "setting getter-only property "gLogEnabled"" {file: "resource://gre/modules/UpdateTimerManager.jsm" line: 352}]'[JavaScript Error: "setting getter-only property "gLogEnabled"" {file: "resource://gre/modules/UpdateTimerManager.jsm" line: 352}]' when calling method: [nsIUpdateTimerManager::registerTimer]
[task 2020-05-16T23:46:30.848Z] 23:46:30     INFO - 1589672790840	Marionette	TRACE	Received observer notification xpcom-will-shutdown
[task 2020-05-16T23:46:30.848Z] 23:46:30     INFO - 1589672790841	Marionette	INFO	Stopped listening on port 2828
[task 2020-05-16T23:46:30.848Z] 23:46:30     INFO - 1589672790841	Marionette	DEBUG	Marionette stopped listening
[task 2020-05-16T23:46:30.908Z] 23:46:30     INFO - ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2020-05-16T23:46:30.933Z] 23:46:30     INFO - ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
[task 2020-05-16T23:46:31.140Z] 23:46:31     INFO - REFTEST INFO | Process mode: e10s
[task 2020-05-16T23:46:31.141Z] 23:46:31  WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
[task 2020-05-16T23:46:31.142Z] 23:46:31     INFO - REFTEST INFO | Running tests in file:///C:/tasks/task_1589554637/build/tests/reftest/tests/layout/reftests/forms/input/hidden/reftest.list
[task 2020-05-16T23:46:31.225Z] 23:46:31     INFO - REFTEST INFO | Running with e10s: True
[task 2020-05-16T23:46:31.225Z] 23:46:31     INFO - REFTEST INFO | Application command: C:\tasks\task_1589554637\build\application\firefox\firefox.exe -marionette --wait-for-browser -profile c:\users\testdr~1\appdata\local\temp\tmpzih_kx.mozrunner
[task 2020-05-16T23:46:31.954Z] 23:46:31     INFO - 1589672791946	Marionette	TRACE	Marionette enabled
[task 2020-05-16T23:46:32.024Z] 23:46:32     INFO - 1589672792017	Marionette	TRACE	Received observer notification toplevel-window-ready
[task 2020-05-16T23:52:42.029Z] 23:52:42    ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output
[task 2020-05-16T23:52:42.032Z] 23:52:42    ERROR - REFTEST ERROR | Force-terminating active process(es).
[task 2020-05-16T23:52:42.033Z] 23:52:42     INFO - REFTEST TEST-INFO | started process screenshot
[task 2020-05-16T23:52:42.278Z] 23:52:42     INFO - REFTEST TEST-INFO | screenshot: exit 0
[task 2020-05-16T23:52:42.300Z] 23:52:42     INFO -  TEST-INFO | crashinject: exit 0
[task 2020-05-16T23:55:47.357Z] 23:55:47     INFO -  WARNING | IO Completion Port failed to signal process shutdown
[task 2020-05-16T23:55:47.358Z] 23:55:47     INFO -  Parent process 6840 exited with children alive:
[task 2020-05-16T23:55:47.359Z] 23:55:47     INFO -  PIDS: 2808, 7444
[task 2020-05-16T23:55:47.360Z] 23:55:47     INFO -  Attempting to kill them, but no guarantee of success
[task 2020-05-16T23:55:47.420Z] 23:55:47    ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 3221225477
[task 2020-05-16T23:55:47.438Z] 23:55:47     INFO -  Traceback (most recent call last):
[task 2020-05-16T23:55:47.438Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 1038, in <module>
[task 2020-05-16T23:55:47.438Z] 23:55:47     INFO -      sys.exit(run_test_harness(parser, options))
[task 2020-05-16T23:55:47.439Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 1030, in run_test_harness
[task 2020-05-16T23:55:47.441Z] 23:55:47     INFO -      result = reftest.runTests(options.tests, options)
[task 2020-05-16T23:55:47.441Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 621, in runTests
[task 2020-05-16T23:55:47.446Z] 23:55:47     INFO -      return self.runSerialTests(manifests, options, cmdargs)
[task 2020-05-16T23:55:47.446Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 978, in runSerialTests
[task 2020-05-16T23:55:47.451Z] 23:55:47     INFO -      status = run(tests=tests)
[task 2020-05-16T23:55:47.451Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 952, in run
[task 2020-05-16T23:55:47.455Z] 23:55:47     INFO -      **kwargs)
[task 2020-05-16T23:55:47.455Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 884, in runApp
[task 2020-05-16T23:55:47.459Z] 23:55:47     INFO -      raise reraise(exc, value, tb)
[task 2020-05-16T23:55:47.460Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\tests\reftest\runreftest.py", line 847, in runApp
[task 2020-05-16T23:55:47.464Z] 23:55:47     INFO -      marionette.start_session()
[task 2020-05-16T23:55:47.464Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
[task 2020-05-16T23:55:47.464Z] 23:55:47     INFO -      m._handle_socket_failure()
[task 2020-05-16T23:55:47.465Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\marionette.py", line 653, in _handle_socket_failure
[task 2020-05-16T23:55:47.467Z] 23:55:47     INFO -      reraise(exc_cls, exc, tb)
[task 2020-05-16T23:55:47.467Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
[task 2020-05-16T23:55:47.467Z] 23:55:47     INFO -      return func(*args, **kwargs)
[task 2020-05-16T23:55:47.468Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1111, in start_session
[task 2020-05-16T23:55:47.472Z] 23:55:47     INFO -      self.raise_for_port(timeout=timeout)
[task 2020-05-16T23:55:47.472Z] 23:55:47     INFO -    File "C:\tasks\task_1589554637\build\venv\lib\site-packages\marionette_driver\marionette.py", line 572, in raise_for_port
[task 2020-05-16T23:55:47.474Z] 23:55:47     INFO -      self.host, self.port))
[task 2020-05-16T23:55:47.474Z] 23:55:47     INFO -  socket.timeout: Timed out waiting for connection on 127.0.0.1:2828!
[task 2020-05-16T23:55:47.646Z] 23:55:47    ERROR - Return code: 1
[task 2020-05-16T23:55:47.648Z] 23:55:47    ERROR - No suite end message was emitted by this harness.
[task 2020-05-16T23:55:47.648Z] 23:55:47     INFO - TinderboxPrint: reftest-reftest<br/>3058/0/13
[task 2020-05-16T23:55:47.648Z] 23:55:47    ERROR - # TBPL FAILURE #
[task 2020-05-16T23:55:47.649Z] 23:55:47  WARNING - setting return code to 2
[task 2020-05-16T23:55:47.649Z] 23:55:47    ERROR - The reftest suite: reftest ran with return status: FAILURE

This bug failed 38 times in the last 7 days. Occurs on windows10-aarch64 on opt build types.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=303563822&repo=mozilla-central&lineNumber=28705

Kats: Any updates on this bug?

Flags: needinfo?(kats)

Unfortunately no. The approach I was trying wasn't really working out, because the MOZ_LOG spew was too much and too nondeterministic, so I couldn't really compare from one run to the next and get a sense of where in firefox startup we were. I wasn't really sure how to proceed further, but I think maybe pursuing botond's "bisecting the hang" approach from this comment might be an avenue to explore. I'll try that next.

Flags: needinfo?(kats)

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=303756656&repo=try&lineNumber=10069 is a job from my most recent try push. Comparing a successful vs failed run shows that the nsWindowWatcher is creating the windows for chrome://gfxsanity/content/sanityparent.html and the chrome://browser/content/browser.xhtml documents, but it doesn't get as far as nsContentDLF::CreateDocument for those documents (whereas it does in the successful case). I'll continue bisecting in between those two points.

More try pushes show that we're not getting to the nsDocumentOpenInfo::OnStartRequest call for either the sanityparent.html or the browser.xhtml windows. Between the window watcher creeating the windows and the OnStartRequest there's a big tangle of async I/O code so i'm trying to trace through that now.

Latest try push shows that the document-loading code does make it to nsChromeProtocolHandler::NewChannel so the problem is likely in the async I/O code. I think that code has pretty good built-in logging so I'll try to find a way to enable that just during the startup portion of the run (so it doesn't overflow the logs) and see if I can get something useful there.

Your last push doesn't seem to be getting any failures so I tried just botond's patch alone with failed pretty easily in the patch

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=3459048ab3196248b5ad7bef3959caf3360ca94b

but and it's green now. So who knows.

According to https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-05-22&endday=2020-05-29&tree=all&bug=1572900 there haven't been any failures on windows10-aarch64 platforms in the last day or so, so I'm guessing there was some config change on the workers that fixed this problem. The try push at https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=f09ac7627fe5ed2851bbca75abd7b98309d02449 was based on the same base revision as my previous try pushes, and I did a new try push at https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=33fa9c1632dc82b3dbec301f76518af1d59dc6e3 with those patches rebased to master, and they're all green. So I'm pretty sure it wasn't something in m-c that fixed it. At any rate, I'm going to stop working on this and we should reland bug 1611660.

:aerickson, do you know who manages the windows10-aarch64 worker machines? Looks like they're the lenovo yoga laptops in bitbar. In particular, I'd like to know if there was any sort of config change on those machines in the last couple of days.

Flags: needinfo?(aerickson)

The timeout is still happening with Botond's patch applied, but on Win 7 MinGW debug reftest jobs. And it's timing out at a different place so I need to back up my investigation a few steps and go down this other codepath.

Some of the failures in this try push seem to indicate that the first "tab" content process that is created hangs on this ::_wfopen call. Not sure why. However in some of the logs it gets further, and I'll keep digging there.

(In reply to Timothy Nikkel (:tnikkel) from comment #69)

This should be a good next step of bisecting upon Kats' latest push

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec8cae04840a6d010d2718d8f7fbeb53b31c3a58

That might be adding printfs in the wrong place? As it gets further than where one of the failures in kats' push was, but another failure on kats' push gets further.

One more
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1115c8c86ec5a5089fc1d8b4fce45a2b09a517b

Not sure if I'm making any progress or going down the wrong path.

There's multiple failure points, it looks like. I'm going to start spinning out individual bugs for them so we can better track what's going on.

(In reply to Timothy Nikkel (:tnikkel) from comment #69)

This should be a good next step of bisecting upon Kats' latest push

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec8cae04840a6d010d2718d8f7fbeb53b31c3a58

This one has the right idea, but all the existing failures are in other places. I've triggered more jobs on this try push, hopefully we'll get one that isolates the hang in CrashReporter::SetRemoteExceptionHandler.

(In reply to Timothy Nikkel (:tnikkel) from comment #72)

One more
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1115c8c86ec5a5089fc1d8b4fce45a2b09a517b

Build failure.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #66)

:aerickson, do you know who manages the windows10-aarch64 worker machines? Looks like they're the lenovo yoga laptops in bitbar. In particular, I'd like to know if there was any sort of config change on those machines in the last couple of days.

:mcornmesser works on those. NI'ing.

Flags: needinfo?(aerickson) → needinfo?(mcornmesser)

:kats, there's a tool that tracks configuration changes to build & test infrastructure (https://changelog.dev.mozaws.net/). I don't see any changes to those in the past few days.

(In reply to Andrew Erickson [:aerickson] from comment #75)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #66)

:aerickson, do you know who manages the windows10-aarch64 worker machines? Looks like they're the lenovo yoga laptops in bitbar. In particular, I'd like to know if there was any sort of config change on those machines in the last couple of days.

:mcornmesser works on those. NI'ing.

There has been no recent config changes that I am aware of.

Flags: needinfo?(mcornmesser)
Attached file pernosco links

heycam noticed there were some linux failures that looked like they could be the same failure. So I asked pernosco and it sent be back these links. They might all be duplicates of the same thing, I always have trouble with the pernosco self service page. So we'll have to look at these and see if they could be the same thing we are seeing on Windows.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #74)

(In reply to Timothy Nikkel (:tnikkel) from comment #69)

This should be a good next step of bisecting upon Kats' latest push

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec8cae04840a6d010d2718d8f7fbeb53b31c3a58

This one has the right idea, but all the existing failures are in other places. I've triggered more jobs on this try push, hopefully we'll get one that isolates the hang in CrashReporter::SetRemoteExceptionHandler.

Oh, it hangs in different places? Ugh.

(In reply to Timothy Nikkel (:tnikkel) from comment #72)

One more
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1115c8c86ec5a5089fc1d8b4fce45a2b09a517b

Build failure.

Fixed build failure
https://treeherder.mozilla.org/#/jobs?repo=try&revision=51757a73e28bc852d63e511a29bbb2ba53420f38

(In reply to Mark Cornmesser [:markco] from comment #77)

There has been no recent config changes that I am aware of.

Thanks. Kinda surprising, but so is everything about this bug.

I've filed bug 1642720 to disable the reftests on win7 mingw builds, per tom's suggestion at https://bugzilla.mozilla.org/show_bug.cgi?id=1611660#c44

(The remaining low-volume failure here is not a regression from bug 1611660, but a pre-existing issue.)

No longer regressed by: 1611660

Recent Windows AArch reftest perma-fails seem to be from bug 1653384.

Regressed by: 1653384

:mossop, do you know what's going on here (as author of regressing bug)?

Flags: needinfo?(kats) → needinfo?(dtownsend)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #92)

:mossop, do you know what's going on here (as author of regressing bug)?

I don't know, all that test does is add a string comparison. Are there any logs to look at? Is that test trying to disable e10s through the environment variable?

Flags: needinfo?(dtownsend)

Per discussion on #developers with :Aryx, it's not clear that's the regressing bug after all. Certainly from the patches it doesn't look like it should be at fault since this job is e10s-enabled. We can bisect on autoland to confirm the regressor.

Retrigger results are inconclusive. This was intermittent at first, became more frequent and is de facto perma-failing now.

In the last 7 days there have been 34 occurrences on windows10-64-ccov and windows10-aarch64 opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=313729807&repo=mozilla-central&lineNumber=4279

Kats, could you, please, direct this to someone who can take a look?

Flags: needinfo?(kats)
Whiteboard: [stockwell unknown] → [stockwell needswork:owner]

Last time I looked into this bug it disappeared on its own as I started doing printfs to narrow down where the startup hang was. I can try that again, will start tomorrow. Leaving needinfo on me for now.

Here's a new rebased try push: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=f880d3d5e737ad54a4864814ba0c60a308db2ae0

If this doesn't show the intermittents either it may be because the extra logging I added is preventing the test from timing out. So then the next step would be to examine the timestamps in the log to see if there is a 300+ second gap between process start and first test running that is only interrupted by the logging I added. If that's the case then it means the startup is just taking longer than the currently set timeout and we should speed things up or increase the timeout. If that's not the case then likely there is still a startup hang and I'm just not hitting it with my try pushes.

Flags: needinfo?(kats)

Ok, that try push got a hang, between this line:

nsWindowWatcher did OpenWindowInternal for chrome://browser/content/browser.xhtml

and these ones:

jar:file:///C:/tasks/task_1599502476/build/application/firefox/omni.ja!/chrome/toolkit/content/gfxsanity/sanityparent.html: creating document
jar:file:///C:/tasks/task_1599502476/build/application/firefox/browser/omni.ja!/chrome/browser/content/browser/browser.xhtml: creating document

(based on logging from successful runs). Will need to bisect further between those two.

There are 30 total failures in the last 7 days on windows10-aarch64 opt

recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=318228208&repo=mozilla-central&lineNumber=2743

[task 2020-10-09T23:55:42.956Z] 23:55:42 INFO - REFTEST TEST-START | dom/html/reftests/toblob-todataurl/todataurl-quality-default.html == dom/html/reftests/toblob-todataurl/quality-92-ref.html
[task 2020-10-09T23:55:42.956Z] 23:55:42 INFO - REFTEST TEST-LOAD | file:///C:/tasks/task_1602276908/build/tests/reftest/tests/dom/html/reftests/toblob-todataurl/todataurl-quality-default.html | 15 / 16 (93%)
[task 2020-10-09T23:55:42.993Z] 23:55:42 INFO - REFTEST TEST-PASS | dom/html/reftests/toblob-todataurl/todataurl-quality-default.html == dom/html/reftests/toblob-todataurl/quality-92-ref.html | image comparison, max difference: 0, number of differing pixels: 0
[task 2020-10-09T23:55:42.994Z] 23:55:42 INFO - REFTEST TEST-END | dom/html/reftests/toblob-todataurl/todataurl-quality-default.html == dom/html/reftests/toblob-todataurl/quality-92-ref.html
[task 2020-10-09T23:55:43.005Z] 23:55:43 INFO - REFTEST INFO | Slowest test took 114ms (file:///C:/tasks/task_1602276908/build/tests/reftest/tests/dom/html/reftests/toblob-todataurl/toblob-quality-0.html)
[task 2020-10-09T23:55:43.006Z] 23:55:43 INFO - REFTEST INFO | Total canvas count = 7
[task 2020-10-09T23:55:43.218Z] 23:55:43 INFO - *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: recipe-client-addon-run
[task 2020-10-09T23:55:43.379Z] 23:55:43 INFO - 1602287743368 Marionette TRACE Received observer notification xpcom-will-shutdown
[task 2020-10-09T23:55:43.379Z] 23:55:43 INFO - 1602287743368 Marionette INFO Stopped listening on port 2828
[task 2020-10-09T23:55:43.379Z] 23:55:43 INFO - 1602287743368 Marionette DEBUG Marionette stopped listening
[task 2020-10-09T23:55:43.752Z] 23:55:43 INFO - REFTEST INFO | Process mode: e10s
[task 2020-10-09T23:55:43.752Z] 23:55:43 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
[task 2020-10-09T23:55:43.752Z] 23:55:43 INFO - REFTEST INFO | Running tests in file:///C:/tasks/task_1602276908/build/tests/reftest/tests/layout/reftests/svg/as-image/zoom/reftest.list
[task 2020-10-09T23:55:43.832Z] 23:55:43 INFO - REFTEST INFO | Running with e10s: True
[task 2020-10-09T23:55:43.832Z] 23:55:43 INFO - REFTEST INFO | Running with fission: False
[task 2020-10-09T23:55:43.832Z] 23:55:43 INFO - REFTEST INFO | Application command: C:\tasks\task_1602276908\build\application\firefox\firefox.exe -marionette --wait-for-browser -profile c:\users\testdr~1\appdata\local\temp\tmpil70cq.mozrunner
[task 2020-10-09T23:55:44.515Z] 23:55:44 INFO - 1602287744507 Marionette TRACE Marionette enabled
[task 2020-10-09T23:55:44.585Z] 23:55:44 INFO - 1602287744578 Marionette TRACE Received observer notification toplevel-window-ready
[task 2020-10-10T00:01:54.605Z] 00:01:54 ERROR - REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output
[task 2020-10-10T00:01:54.608Z] 00:01:54 ERROR - REFTEST ERROR | Force-terminating active process(es).
[task 2020-10-10T00:01:54.609Z] 00:01:54 INFO - REFTEST TEST-INFO | started process screenshot
[task 2020-10-10T00:01:54.900Z] 00:01:54 INFO - REFTEST TEST-INFO | screenshot: exit 0
[task 2020-10-10T00:01:54.920Z] 00:01:54 INFO - TEST-INFO | crashinject: exit 0
[task 2020-10-10T00:04:59.973Z] 00:04:59 INFO - WARNING | IO Completion Port failed to signal process shutdown
[task 2020-10-10T00:04:59.974Z] 00:04:59 INFO - Parent process 5836 exited with children alive:
[task 2020-10-10T00:04:59.975Z] 00:04:59 INFO - PIDS: 11700, 11548
[task 2020-10-10T00:04:59.976Z] 00:04:59 INFO - Attempting to kill them, but no guarantee of success
[task 2020-10-10T00:05:00.004Z] 00:05:00 ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 3221225477
[task 2020-10-10T00:05:00.064Z] 00:05:00 INFO - Traceback (most recent call last):
[task 2020-10-10T00:05:00.064Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\tests\reftest\runreftest.py", line 1053, in <module>
[task 2020-10-10T00:05:00.064Z] 00:05:00 INFO - sys.exit(run_test_harness(parser, options))
[task 2020-10-10T00:05:00.065Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\tests\reftest\runreftest.py", line 1045, in run_test_harness
[task 2020-10-10T00:05:00.067Z] 00:05:00 INFO - result = reftest.runTests(options.tests, options)
[task 2020-10-10T00:05:00.067Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\tests\reftest\runreftest.py", line 635, in runTests
[task 2020-10-10T00:05:00.072Z] 00:05:00 INFO - return self.runSerialTests(manifests, options, cmdargs)
[task 2020-10-10T00:05:00.072Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\tests\reftest\runreftest.py", line 993, in runSerialTests
[task 2020-10-10T00:05:00.077Z] 00:05:00 INFO - status = run(tests=tests)
[task 2020-10-10T00:05:00.077Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\tests\reftest\runreftest.py", line 967, in run
[task 2020-10-10T00:05:00.082Z] 00:05:00 INFO - **kwargs)
[task 2020-10-10T00:05:00.082Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\tests\reftest\runreftest.py", line 899, in runApp
[task 2020-10-10T00:05:00.084Z] 00:05:00 INFO - raise reraise(exc, value, tb)
[task 2020-10-10T00:05:00.084Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\tests\reftest\runreftest.py", line 862, in runApp
[task 2020-10-10T00:05:00.089Z] 00:05:00 INFO - marionette.start_session()
[task 2020-10-10T00:05:00.089Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
[task 2020-10-10T00:05:00.089Z] 00:05:00 INFO - m._handle_socket_failure()
[task 2020-10-10T00:05:00.090Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\venv\lib\site-packages\marionette_driver\marionette.py", line 654, in _handle_socket_failure
[task 2020-10-10T00:05:00.092Z] 00:05:00 INFO - reraise(exc_cls, exc, tb)
[task 2020-10-10T00:05:00.092Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
[task 2020-10-10T00:05:00.092Z] 00:05:00 INFO - return func(*args, **kwargs)
[task 2020-10-10T00:05:00.093Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1113, in start_session
[task 2020-10-10T00:05:00.098Z] 00:05:00 INFO - self.raise_for_port(timeout=timeout)
[task 2020-10-10T00:05:00.098Z] 00:05:00 INFO - File "C:\tasks\task_1602276908\build\venv\lib\site-packages\marionette_driver\marionette.py", line 573, in raise_for_port
[task 2020-10-10T00:05:00.100Z] 00:05:00 INFO - self.host, self.port))
[task 2020-10-10T00:05:00.100Z] 00:05:00 INFO - socket.timeout: Timed out waiting for connection on 127.0.0.1:2828!
[task 2020-10-10T00:05:00.221Z] 00:05:00 ERROR - Return code: 1
[task 2020-10-10T00:05:00.224Z] 00:05:00 ERROR - No suite end message was emitted by this harness.
[task 2020-10-10T00:05:00.224Z] 00:05:00 INFO - TinderboxPrint: reftest-reftest<br/>354/0/0
[task 2020-10-10T00:05:00.224Z] 00:05:00 ERROR - # TBPL FAILURE #
[task 2020-10-10T00:05:00.225Z] 00:05:00 WARNING - setting return code to 2
[task 2020-10-10T00:05:00.225Z] 00:05:00 ERROR - The reftest suite: reftest ran with return status: FAILURE

Kartikaya are there updates here?

Flags: needinfo?(kats)
Whiteboard: [stockwell unknown] → [stockwell needswork:owner]

Sorry I haven't had a chance to dig into this further, and probably won't in the coming week either. But the week after that I can try to make some progress here.

Flags: needinfo?(kats)
Summary: Intermittent REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output | application terminated with exit code 3221225477 → Very Frequent REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output | application terminated with exit code 3221225477

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #113)

Sorry I haven't had a chance to dig into this further, and probably won't in the coming week either. But the week after that I can try to make some progress here.

Understood, thank you.

Whiteboard: [stockwell needswork:owner] → [stockwell needswork:owner][comment 113]

I did try to make a bit of progress on this bug this week but didn't get very far. As best I can tell so far we get to the point where we try to open the window for browser.xhtml but we never come out on the other side of the networking code. And the networking code is a big tangle of async communication so it's pretty hard to trace through. I'll keep poking at it though. Hopefully I can confirm it's getting lost in the networking code and then maybe I can use existing networking logging to track it down. Or at least ask somebody from the networking team for help.

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=056286c8f9fdea8911becd165dd86efb94f0e8da
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=3307359b7feb2a5b950e6726fc4732fd3fa086a1

The last try push above didn't have any failures. Rebased and tried again: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=f95d0ce4c9797acc1f4ef70a8e3f53bd67e93599 - still no failures. Maybe I scared away the bug. The TH data also seems to indicate a reduction in the last couple of days.

In the last 7 days there have been 19 occurrences, all on windows10-aarch64 opt.

Recent failure: https://treeherder.mozilla.org/logviewer?job_id=321036306&repo=mozilla-central&lineNumber=38216

Summary: Very Frequent REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output | application terminated with exit code 3221225477 → Intermittent REFTEST ERROR | Last test finished | application timed out after 370 seconds with no output | application terminated with exit code 3221225477
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 3 months ago27 days ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.