Closed Bug 1418575 Opened 7 years ago Closed 6 years ago

Intermittent reftest | application timed out after 370 seconds with no output

Categories

(Testing :: Reftest, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

It looks like most recent failures here are on startup, similar in appearance to bug 1414495, (but with a slightly different title, for reftest). Bug 1414495 is waiting on a needinfo currently.
Depends on: 1414495
Whiteboard: [stockwell needswork]
There are 40 failures in the past 7 days, most of them on windows10-64, windows10-64-qr and one on windows10-64-ccov, opt and debug.
Recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=153572745
And a relevant part of the log:
 nsStringStats
1733
23:10:28     INFO -   => mAllocCount:          14287
1734
23:10:28     INFO -   => mReallocCount:          112
1735
23:10:28     INFO -   => mFreeCount:           14286  --  LEAKED 1 !!!
1736
23:10:28     INFO -   => mShareCount:           7533
1737
23:10:28     INFO -   => mAdoptCount:           1628
1738
23:10:28     INFO -   => mAdoptFreeCount:       1628
1739
23:10:28     INFO -   => Process ID: 9236, Thread ID: 9240
1740
23:10:28     INFO -  DLL blocklist was unable to intercept AppInit DLLs.
1741
23:10:28     INFO -  ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\runreftest_leaks_tab_pid9784.log
1742
23:10:28     INFO -  [Child 9784, Main Thread] WARNING: No CID found when attempting to map contract ID: file z:/build/build/src/xpcom/components/nsComponentManager.cpp, line 508
1743
23:10:30     INFO -  --DOCSHELL 0000023E726A2800 == 7 [pid = 9140] [id = {1629306b-b100-4255-b0ba-f785b6dd951f}]
1744
23:10:30     INFO -  --DOCSHELL 0000023E76C4C000 == 6 [pid = 9140] [id = {f8870708-5c93-4fc3-ab4b-91d9401353cf}]
1745
23:10:30     INFO -  --DOCSHELL 0000023E75116000 == 5 [pid = 9140] [id = {13e6f8e3-4414-4009-8cb0-12d8d89cb225}]
1746
23:10:30     INFO -  --DOCSHELL 0000023E74BE8000 == 4 [pid = 9140] [id = {3e18e2e6-aba4-4e2c-86fc-f5c81c3d4e96}]
1747
23:10:31     INFO -  --DOMWINDOW == 18 (0000023E71AC83F0) [pid = 9140] [serial = 8] [outer = 0000000000000000] [url = about:blank]
1748
23:10:33     INFO -  --DOCSHELL 000002A8A2340000 == 1 [pid = 8652] [id = {041af833-57cf-4884-b033-bfeaac86faef}]
1749
23:10:35     INFO -  --DOMWINDOW == 17 (0000023E71E6CC00) [pid = 9140] [serial = 4] [outer = 0000000000000000] [url = about:blank]
1750
23:10:35     INFO -  --DOMWINDOW == 16 (0000023E72071800) [pid = 9140] [serial = 13] [outer = 0000000000000000] [url = about:blank]
1751
23:10:35     INFO -  --DOMWINDOW == 15 (0000023E72073800) [pid = 9140] [serial = 14] [outer = 0000000000000000] [url = about:blank]
1752
23:10:35     INFO -  --DOMWINDOW == 14 (0000023E78041800) [pid = 9140] [serial = 11] [outer = 0000000000000000] [url = about:blank]
1753
23:10:37     INFO -  --DOMWINDOW == 5 (000002A8AC51D400) [pid = 8652] [serial = 5] [outer = 0000000000000000] [url = about:blank]
1754
23:10:37     INFO -  --DOMWINDOW == 4 (000002A8A6992800) [pid = 8652] [serial = 2] [outer = 0000000000000000] [url = about:blank]
1755
23:10:38     INFO -  --DOMWINDOW == 13 (0000023E71ACA4E0) [pid = 9140] [serial = 9] [outer = 0000000000000000] [url = about:blank]
1756
23:10:40     INFO -  [DEBUG SHUTDOWN] Shutdown: decoder=000002A8AC7CB000 state machine=000002A8AC7FE000
1757
23:10:40     INFO -  --DOMWINDOW == 3 (000002A8A67B6470) [pid = 8652] [serial = 1] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html]
1758
23:10:40     INFO -  [DEBUG SHUTDOWN] Enter: state machine=000002A8A5A20980 reader=000002A8A69EB670
1759
23:10:40     INFO -  [DEBUG SHUTDOWN] Shutdown: reader=000002A8AC7FB000 shutdown demuxer=000002A8A5A20280
1760
23:10:40     INFO -  [DEBUG SHUTDOWN] Shutdown: pool=000002A8AC5EFB80 count=1
1761
23:10:40     INFO -  [DEBUG SHUTDOWN] operator (): pool=000002A8AC5EFB80 shutdown=true count=0
1762
23:10:40     INFO -  [DEBUG SHUTDOWN] TearDownDecoders: reader=000002A8AC7FB000 shut down audio task queue
1763
23:10:40     INFO -  [DEBUG SHUTDOWN] TearDownDecoders: reader=000002A8AC7FB000 shut down video task queue
1764
23:10:40     INFO -  [DEBUG SHUTDOWN] FinishShutdown: state machine=000002A8AC7FE000
1765
23:10:40     INFO -  [DEBUG SHUTDOWN] Unregister: decoder=000002A8AC7CB000, count=0
1766
23:10:41     INFO -  [Parent 9140, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/dom/fetch/FetchDriver.cpp, line 696
1767
23:10:41     INFO -  --DOMWINDOW == 12 (0000023E76AB0C00) [pid = 9140] [serial = 10] [outer = 0000000000000000] [url = about:blank]
1768
23:10:43     INFO -  --DOMWINDOW == 2 (000002A8AC572400) [pid = 8652] [serial = 3] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html]
1769
23:10:43     INFO -  --DOMWINDOW == 2 (0000016D269F6C00) [pid = 9440] [serial = 2] [outer = 0000000000000000] [url = about:blank]
1770
23:16:53     INFO -  REFTEST ERROR | reftest | application timed out after 370 seconds with no output
1771
23:16:53     INFO -  REFTEST ERROR | Force-terminating active process(es).
1772
23:16:53     INFO -  REFTEST TEST-INFO | started process screenshot
1773
23:16:53     INFO -  REFTEST TEST-INFO | screenshot: exit 0
1774
23:16:53     INFO -  TEST-INFO | crashinject: exit 0
1775
23:16:54  WARNING -  TEST-UNEXPECTED-FAIL | reftest | application terminated with exit code 1
1776
23:16:54     INFO -  REFTEST INFO | Copy/paste: Z:\task_1514588294\build\win32-minidump_stackwalk.exe c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\minidumps\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp Z:\task_1514588294\build\symbols
1777
23:17:06     INFO -  REFTEST INFO | Saved minidump as Z:\task_1514588294\build\blobber_upload_dir\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp
1778
23:17:06     INFO -  REFTEST INFO | Saved app info as Z:\task_1514588294\build\blobber_upload_dir\7a64de14-1cbd-4174-b541-c65a261e9b90.extra
1779
23:17:06    ERROR -  REFTEST PROCESS-CRASH | reftest | application crashed [@ CrashingThread(void *)]
1780
23:17:06     INFO -  Crash dump filename: c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\minidumps\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp
1781
23:17:06     INFO -  Operating system: Windows NT
1782
23:17:06     INFO -                    10.0.15063
1783
23:17:06     INFO -  CPU: amd64
1784
23:17:06     INFO -       family 6 model 45 stepping 7
1785
23:17:06     INFO -       8 CPUs
1786
23:17:06     INFO -  GPU: UNKNOWN
1787
23:17:06     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_WRITE
1788
23:17:06     INFO -  Crash address: 0x0
1789
23:17:06     INFO -  Process uptime: 395 seconds
1790
23:17:06     INFO -  Thread 72 (crashed)

:gbrown Any updates here?
Flags: needinfo?(gbrown)
Depends on: 1427946
(In reply to Noemi Erli[:noemi_erli] from comment #10)
> :gbrown Any updates here?

I had been waiting for progress on bug 1414495, as indicated in comment 6.

However, I noticed that many recent failures are in Windows 10 crashtests and most of those have similar screenshots, which I have not seen before; I filed bug 1427946 to try to understand those screenshots better.
Flags: needinfo?(gbrown)
I think the fix for bug 1427946 has helped a lot: I no longer see those TeamViewer screenshots, and there are fewer failures reported here in the last 24 hours or so.

Remaining failures might require progress on bug 1414495.
most of these failures are in QR builds- lets see if :kats might have ideas?
Flags: needinfo?(bugmail)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
All the QR failures I looked at seem to have a configmymonitor.exe startup failure, see example screenshot: https://public-artifacts.taskcluster.net/SzkAyxr8ScSrcK4jNablXg/0/public/test_info/mozilla-test-fail-screenshot_ricvh2.png

So this seems to point to a configuration issue similar to the TeamViewer thing. Also according to OF the last one of these that happened on a QR build was jan 5, so maybe it's not an issue any more.
Flags: needinfo?(bugmail)
There have been 41 failures in the last 7 days and 168 failures in teh last 21 days.
For the failure pattern see Comment 10

Recent failure log:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=157051190&lineNumber=1733

I checked a few logs and noticed that the reftest analyzer is not loading properly.
Example: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/ID5xSavmQmyp1-hqAkayvw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1

:jmaher there are more than 150 failures in the last 21 days, but the failure rate went down in the last week. Should we still consider disabling this test? 

Thank you!
Flags: needinfo?(jmaher)
what is interesting here is that out of 36 failures in the last week, only 3 are on hardware (i.e. win10 reftest) and the rest are on VM (i.e. taskcluster)- also 2 of the failures are not win10 (1xlinux64, 1xwin7).

Given that reftests run at the same rate as crashtests or jsreftests, I would conclude that any focus here is on the VMs, and not reftests on hardware.

I looked at some logs and saw some differences between the hardware and vms:
hardware:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=157030919&lineNumber=1949

vm:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=156698165&lineNumber=1675


on hardware we do not have statements like:
[DEBUG SHUTDOWN] Shutdown: decoder=00000227C5E85800 state machine=00000227CBA96000

The above statement is from the above mentioned VM log- which is an opt build.  So either we are not hitting any initialization when on hardware, or there is something different in the OS/Env which is causing this difference.

The screenshot from the vm in the above log is:
https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/test_info/mozilla-test-fail-screenshot_fdtd7g.png

there is a reference to a failure for configmymonitor.exe which cannot start- not sure if this is related or not, but we should fix that.


A few differences in the environment variables:
1) live logging ports specified for VM/Taskcluster
2) KTS/monitoring for Hardware/Buildbot
3) taskcluster is missing these MOZ_ env variables:
'MOZ_CRASHREPORTER_NO_REPORT': '1',
'MOZ_HIDE_RESULTS_TABLE': '1',
'MOZ_NO_REMOTE': '1',
'NO_EM_RESTART': '1',
'NO_FAIL_ON_TEST_ERRORS': '1',
'XPCOM_DEBUG_BREAK': 'warn'

^ NOTE: I have a webrender link above which has different env vars, but all VM logs are missing these env vars.


I cannot find a spot in the scripts where we set these environment variables- many harnesses set them, but it seems as though these are typically setup at the OS level and inherited prior to running.

:grenade, do you know how to determine on windows 10 what the environment variables are by default on the IX hardware vs the VM environment?
Flags: needinfo?(jmaher) → needinfo?(rthijssen)
i would just use task creator [1] to run a task that runs a single command of: `set` [2]. once on a hw worker type and once on a vm. that should give a list of the vars set for the user that runs the tests. the default vars would differ for the admin user so it's important to list the vars that are set for the user that actually runs tasks. generic-worker creates the user that runs the tasks and also sets a number of environment and profile configurations for that user so it's more relevant to get the env settings at the beginning of a task run than env settings at machine configuration time which would be quite a different thing.

[1]: https://tools.taskcluster.net/task-creator/
[2]: https://ss64.com/nt/set.html
Flags: needinfo?(rthijssen)
ok, I did the edit task and added some env vars, they seem to pass through to the test harness- it looks like I could set them here in the docker image creation area:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/docker_image.py#148

in addition I see other places we set similar variables:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/mozharness_test.py#251


likewise, possibly we ensure the harness is set properly (we would need to do this for ALL harnesses to be more complete.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #22)
> The screenshot from the vm in the above log is:
> https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/
> test_info/mozilla-test-fail-screenshot_fdtd7g.png
> 
> there is a reference to a failure for configmymonitor.exe which cannot
> start- not sure if this is related or not, but we should fix that.

thanks for spotting that. i've now removed configmymonitor and fakemon from these instances as it doesn't work on windows 10 thus far.
https://github.com/mozilla-releng/OpenCloudConfig/commit/a3453c1e0cb373a381a53d0f3b3d733668cad41d
here is a case where I have added environment variables:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c21026ce655b91ecbf061363c84ba268f27f5709&selectedJob=157409919

looking at the jsreftest log, I see we time out, then start up again and run all the tests successfully:
https://public-artifacts.taskcluster.net/Qu13jFheQ86LgnlvXvBAKA/0/public/logs/live_backing.log

same for the crashtest failure:
https://public-artifacts.taskcluster.net/LrrI6B4tSA-dNu_VSyLA6A/0/public/logs/live_backing.log

I noticed the first time we launched the browser I see this:
01:10:33     INFO -  REFTEST INFO | Application command: Z:\task_1516410509\build\application\firefox\firefox.exe -marionette -profile c:\users\genericworker\appdata\local\temp\tmpufuw0c.mozrunner

then the second time, I see this:
01:17:09     INFO -  REFTEST INFO | Application command: Z:\task_1516410509\build\application\firefox\firefox.exe -marionette -marionette -profile c:\users\genericworker\appdata\local\temp\tmpdtfhav.mozrunner


why do we have two -marionette args- there must be a loop somewhere.  This also proves the machine is valid and workable- we just didn't start the test after launching the browser.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #26)
> why do we have two -marionette args- there must be a loop somewhere.  This
> also proves the machine is valid and workable- we just didn't start the test
> after launching the browser.

I think we run tests until success, to allow for resume-after-crash:

https://dxr.mozilla.org/mozilla-central/rev/5faab9e619901b1513fd4ca137747231be550def/layout/tools/reftest/runreftest.py#787

"-marionette" will be added every time, but I don't think that makes a difference to anything.
Depends on: 1434123
I don't know what is going wrong here.

We recently added marionette trace logging to reftests and we have some failure logs with that additional logging now -- but the marionette trace logging looks normal to me.

Logs suggest to me that firefox starts up normally, but then hangs. On the other hand, screenshots for these failures do not show firefox displayed - odd.
(In reply to Rob Thijssen (:grenade UTC+2) from comment #25)
> (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #22)
> > The screenshot from the vm in the above log is:
> > https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/
> > test_info/mozilla-test-fail-screenshot_fdtd7g.png
> > 
> > there is a reference to a failure for configmymonitor.exe which cannot
> > start- not sure if this is related or not, but we should fix that.
> 
> thanks for spotting that. i've now removed configmymonitor and fakemon from
> these instances as it doesn't work on windows 10 thus far.
> https://github.com/mozilla-releng/OpenCloudConfig/commit/
> a3453c1e0cb373a381a53d0f3b3d733668cad41d

:grenade -- Some recent screenshots still show configmymonitor errors.

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=158892135&lineNumber=1810

https://public-artifacts.taskcluster.net/DePvXD_2STKi6w8mLvRoFw/1/public/test_info/mozilla-test-fail-screenshot_rpfod_.png
Flags: needinfo?(rthijssen)
the removal patch went in but was reverted due to an unrelated problem. an ami redeploy should fix it. in progress now:
https://tools.taskcluster.net/groups/BWLMji0_Qmek1p3KCl8HqQ/tasks/M10X4t_vRtCW7BYmRLypAQ/details
Flags: needinfo?(rthijssen)
No new failures since Feb 8! (I don't know why. Maybe comment 32...?)
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Geoff, I think we can close this bug?
Flags: needinfo?(gbrown)
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(gbrown)
Resolution: --- → WORKSFORME
See Also: → 1441580
You need to log in before you can comment on or make changes to this bug.