Closed Bug 1416125 Opened 2 years ago Closed Last year

Intermittent TEST-UNEXPECTED-FAIL | remoteautomation.py | application timed out after 670 seconds with no output

Categories

(Firefox for Android :: Testing, defect, P5)

defect

Tracking

()

RESOLVED FIXED
Firefox 62
Tracking Status
firefox62 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:backout])

Attachments

(1 file)

This has become frequent on Android today (seen at least on autoland, inbound and central, but starting around merges, so there is still the possibility that this is from code changes), see e.g. https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=b39c21c2a7a4a6b3eceafcc78b645a27e104cd17&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel
Retriggers are green.

Geoff, can you take a look at this, please?

https://treeherder.mozilla.org/logviewer.html#?job_id=160589649&repo=autoland

[task 2018-02-06T12:59:33.715Z] 12:59:33     INFO -  REFTEST INFO | Running with e10s: False
[task 2018-02-06T12:59:43.755Z] 12:59:43     INFO -  INFO | automation.py | Application pid: 839
[task 2018-02-06T13:01:15.025Z] 13:01:15     INFO -  JavaScript error: resource://gre/modules/FileUtils.jsm, line 64: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIProperties.get]
[task 2018-02-06T13:01:22.664Z] 13:01:22     INFO -  REFTEST INFO | Reading manifest http://10.0.2.2:8888/../jsreftest/tests/jstests.list
[task 2018-02-06T13:01:22.664Z] 13:01:22     INFO -  REFTEST INFO | Dumping JSON representation of sandbox
[task 2018-02-06T13:01:22.665Z] 13:01:22     INFO -  REFTEST INFO | {"isDebugBuild":true,"xulRuntime":{"widgetToolkit":"android","OS":"Android","XPCOMABI":"arm-eabi-gcc3"},"smallScreen":false,"d2d":false,"dwrite":false,"gpuProcess":false,"azureCairo":false,"azureSkia":true,"skiaContent":true,"azureSkiaGL":1,"contentSameGfxBackendAsCanvas":true,"layersGPUAccelerated":true,"d3d11":false,"d3d9":false,"layersOpenGL":true,"webrender":false,"layersOMTC":true,"advancedLayers":false,"layerChecksEnabled":true,"retainedDisplayList":false,"Android":true,"cocoaWidget":false,"gtkWidget":false,"qtWidget":false,"winWidget":false,"transparentScrollbars":false,"AndroidVersion":18,"AddressSanitizer":false,"webrtc":true,"retainedDisplayLists":false,"compareRetainedDisplayLists":false,"stylo":true,"styloVsGecko":false,"skiaPdf":false,"release_or_beta":false,"http":{"userAgent":"Mozilla/5.0 (Android 4.3.1; Mobile; rv:60.0) Gecko/60.0 Firefox/60.0","appName":"Mozilla","appVersion":"5.0","platform":"Android 4.3.1","oscpu":"Linux armv7l","misc":"rv:60.0"},"haveTestPlugin":false,"windowsDefaultTheme":false,"nativeThemePref":false,"gpuProcessForceEnabled":false,"prefs":{},"browserIsRemote":false,"asyncPan":true,"usesRepeatResampling":false}
[task 2018-02-06T13:13:34.313Z] 13:13:34     INFO -  Browser unexpectedly found running. Killing...
[task 2018-02-06T13:13:34.314Z] 13:13:34     INFO -  REFTEST TEST-INFO | started process screentopng
[task 2018-02-06T13:13:34.334Z] 13:13:34     INFO -  XScreenSaver state: Disabled
[task 2018-02-06T13:13:34.335Z] 13:13:34     INFO -  User input has been idle for 1878 seconds
[task 2018-02-06T13:13:34.670Z] 13:13:34     INFO -  REFTEST TEST-INFO | screentopng: exit 0
[task 2018-02-06T13:13:38.461Z] 13:13:38     INFO -  org.mozilla.fennec_aurora still alive after SIGABRT: waiting...
[task 2018-02-06T13:13:43.583Z] 13:13:43  WARNING -  TEST-UNEXPECTED-FAIL | remoteautomation.py | application timed out after 670 seconds with no output
Flags: needinfo?(gbrown)
I see frequent Android debug jsreftest timeouts on startup. Firefox is running and the reftest harness is active, but not running tests. That seems really unusual. I'll try to narrow this down with some retriggers...
Yeah, seems like it. It doesn't make any sense to me how this could be happening only for jsreftest and only on Android, but I guess the results don't lie. I'll see if I can find anything.
Flags: needinfo?(ahalberstadt)
Here's a log from before my push:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=156759327

And here's after:
https://treeherder.mozilla.org/logviewer.html#?job_id=160640120&repo=autoland

The logs around the failure look identical, so my best guess is that there's an existing race condition and my patch somehow affects the timing in the wrong way.
The screentopng thing is irrelevant. The real problem is the ~15 minute log silence after building the condition sandbox. So we're hanging somewhere in the manifest parsing code. If it is a pre-existing race condition, it would make sense for my patch to affect the timing, as I fairly heavily refactored that code.
Looks like this is a heisenbug. I added some debug statements to try and figure out what's happening, and everything is coming back green (normally the oranges would have happened well before the first green):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=329c3c18c1800655f304a148ab1be1470092c49a
See Also: → 1392391
Whiteboard: [stockwell needswork:owner] → [stockwell fixed:backout]
I don't know what the issue is here, as I can't reproduce locally and adding debug statements on try decreases the frequency back down to previous levels. I know it's bad, but I'm going to re-land bug 1392391 with the strategic debug statements in place (though I'll be sure they're also useful). I don't see other alternatives besides not re-landing bug 1392391.

For posterity, here are the try runs with/without the debug statements:
https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt@mozilla.com&tochange=31e1c34238f24cc96145b348b83d4e2f2cd5b7b0&fromchange=791c16ddd68e69e97b86375ac3df732891ce7106
Priority: -- → P5
There are 32 failures for the past 7 days, all of them occurring on android-4-3-armv7-api16 debug.

Log file: https://treeherder.mozilla.org/logviewer.html#?job_id=179086764&repo=autoland&lineNumber=1598

Log snippet:

[task 2018-05-17T23:07:37.792Z] 23:07:37     INFO -  INFO | automation.py | Launching: /builds/worker/workspace/build/hostutils/host-utils-61.0a1.en-US.linux-x86_64/xpcshell -g /builds/worker/workspace/build/hostutils/host-utils-61.0a1.en-US.linux-x86_64 -f /builds/worker/workspace/build/tests/modules/httpd.js -e "const _PROFILE_PATH = '/tmp/tmpwW2T7L';const _SERVER_PORT = '8888'; const _SERVER_ADDR ='10.0.2.2';" -f /builds/worker/workspace/build/tests/reftest/server.js
[task 2018-05-17T23:07:37.795Z] 23:07:37     INFO -  INFO | remotereftests.py | Server pid: 948
[task 2018-05-17T23:07:38.797Z] 23:07:38     INFO -  Device info:
[task 2018-05-17T23:07:39.622Z] 23:07:39     INFO -    uptime: 0 days 0 hours 3 minutes 7 seconds
[task 2018-05-17T23:07:39.622Z] 23:07:39     INFO -    battery: 50.0
[task 2018-05-17T23:07:39.622Z] 23:07:39     INFO -    process:
[task 2018-05-17T23:07:39.624Z] 23:07:39     INFO -       USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
[task 2018-05-17T23:07:39.624Z] 23:07:39     INFO -       root      1     0     364    268   c0098770 00017d58 S /init
[task 2018-05-17T23:07:39.624Z] 23:07:39     INFO -       root      2     0     0      0     c005048c 00000000 S kthreadd
[task 2018-05-17T23:07:39.624Z] 23:07:39     INFO -       root      3     2     0      0     c0042268 00000000 S ksoftirqd/0
[task 2018-05-17T23:07:39.626Z] 23:07:39     INFO -       root      4     2     0      0     c004ce30 00000000 S events/0
[task 2018-05-17T23:07:39.626Z] 23:07:39     INFO -       root      5     2     0      0     c004ce30 00000000 S khelper
[task 2018-05-17T23:07:39.627Z] 23:07:39     INFO -       root      6     2     0      0     c004ce30 00000000 S suspend
[task 2018-05-17T23:07:39.627Z] 23:07:39     INFO -       root      7     2     0      0     c004ce30 00000000 S kblockd/0
[task 2018-05-17T23:07:39.628Z] 23:07:39     INFO -       root      8     2     0      0     c004ce30 00000000 S cqueue
[task 2018-05-17T23:07:39.630Z] 23:07:39     INFO -       root      9     2     0      0     c016f7c4 00000000 S kseriod
[task 2018-05-17T23:07:39.630Z] 23:07:39     INFO -       root      10    2     0      0     c004ce30 00000000 S kmmcd
[task 2018-05-17T23:07:39.630Z] 23:07:39     INFO -       root      11    2     0      0     c006f36c 00000000 S pdflush
[task 2018-05-17T23:07:39.631Z] 23:07:39     INFO -       root      12    2     0      0     c006f36c 00000000 S pdflush
[task 2018-05-17T23:07:39.631Z] 23:07:39     INFO -       root      13    2     0      0     c007340c 00000000 S kswapd0
[task 2018-05-17T23:07:39.631Z] 23:07:39     INFO -       root      14    2     0      0     c004ce30 00000000 S aio/0
[task 2018-05-17T23:07:39.631Z] 23:07:39     INFO -       root      25    2     0      0     c016d0f8 00000000 S mtdblockd
[task 2018-05-17T23:07:39.632Z] 23:07:39     INFO -       root      26    2     0      0     c004ce30 00000000 S kstriped
[task 2018-05-17T23:07:39.632Z] 23:07:39     INFO -       root      27    2     0      0     c004ce30 00000000 S hid_compat
[task 2018-05-17T23:07:39.633Z] 23:07:39     INFO -       root      28    2     0      0     c004ce30 00000000 S rpciod/0
[task 2018-05-17T23:07:39.633Z] 23:07:39     INFO -       root      29    2     0      0     c0189ddc 00000000 S mmcqd
[task 2018-05-17T23:07:39.634Z] 23:07:39     INFO -       root      30    1     356    204   c0098770 00017d58 S /sbin/ueventd
[task 2018-05-17T23:07:39.635Z] 23:07:39     INFO -       system    31    1     852    348   c0195c08 400342d8 S /system/bin/servicemanager
[task 2018-05-17T23:07:39.635Z] 23:07:39     INFO -       root      32    1     3976   868   ffffffff 40049b90 S /system/bin/vold
[task 2018-05-17T23:07:39.636Z] 23:07:39     INFO -       root      34    1     8704   1256  ffffffff 40064b90 S /system/bin/netd
[task 2018-05-17T23:07:39.636Z] 23:07:39     INFO -       root      35    1     888    440   c01a10a0 40042f20 S /system/bin/debuggerd
[task 2018-05-17T23:07:39.637Z] 23:07:39     INFO -       radio     36    1     5464   884   ffffffff 40034b90 S /system/bin/rild
[task 2018-05-17T23:07:39.637Z] 23:07:39     INFO -       system    37    1     42388  2400  ffffffff 400772d8 S /system/bin/surfaceflinger
[task 2018-05-17T23:07:39.637Z] 23:07:39     INFO -       root      38    1     287112 34708 ffffffff 40042424 S zygote
[task 2018-05-17T23:07:39.638Z] 23:07:39     INFO -       drm       39    1     6796   2480  ffffffff 400d52d8 S /system/bin/drmserver
[task 2018-05-17T23:07:39.638Z] 23:07:39     INFO -       media     40    1     18036  5596  ffffffff 4009a2d8 S /system/bin/mediaserver
[task 2018-05-17T23:07:39.639Z] 23:07:39     INFO -       install   41    1     848    472   c021db90 40042014 S /system/bin/installd
[task 2018-05-17T23:07:39.639Z] 23:07:39     INFO -       keystore  42    1     3216   1224  c0195c08 400422d8 S /system/bin/keystore
[task 2018-05-17T23:07:39.640Z] 23:07:39     INFO -       root      43    1     784    376   c00b4eb0 400433dc S /system/bin/qemud
[task 2018-05-17T23:07:39.640Z] 23:07:39     INFO -       shell     46    1     776    460   c0148178 4002f014 S /system/bin/sh
[task 2018-05-17T23:07:39.641Z] 23:07:39     INFO -       root      47    1     5536   296   ffffffff 00016c7c S /sbin/adbd
[task 2018-05-17T23:07:39.641Z] 23:07:39     INFO -       system    278   38    367916 39832 ffffffff 400422d8 S system_server
[task 2018-05-17T23:07:39.641Z] 23:07:39     INFO -       u0_a15    364   38    301708 21092 ffffffff 400433dc S android.process.media
[task 2018-05-17T23:07:39.642Z] 23:07:39     INFO -       radio     395   38    318204 23808 ffffffff 400433dc S com.android.phone
[task 2018-05-17T23:07:39.642Z] 23:07:39     INFO -       u0_a23    406   38    314412 29520 ffffffff 400433dc S com.android.launcher
[task 2018-05-17T23:07:39.643Z] 23:07:39     INFO -       system    432   38    304148 18148 ffffffff 400433dc S com.android.settings
[task 2018-05-17T23:07:39.643Z] 23:07:39     INFO -       u0_a10    457   38    316104 25648 ffffffff 400433dc S android.process.acore
[task 2018-05-17T23:07:39.644Z] 23:07:39     INFO -       u0_a25    510   38    313760 48560 ffffffff 400433dc S com.android.systemui
[task 2018-05-17T23:07:39.644Z] 23:07:39     INFO -       u0_a32    525   38    302564 18952 ffffffff 400433dc S com.android.mms
[task 2018-05-17T23:07:39.645Z] 23:07:39     INFO -       u0_a20    560   38    299816 20692 ffffffff 400433dc S com.android.inputmethod.latin
[task 2018-05-17T23:07:39.645Z] 23:07:39     INFO -       root      598   47    744    420   c0098770 40034424 S logcat
[task 2018-05-17T23:07:39.646Z] 23:07:39     INFO -       u0_a19    623   38    296636 16172 ffffffff 400433dc S com.android.defcontainer
[task 2018-05-17T23:07:39.646Z] 23:07:39     INFO -       root      744   47    760    428   c002a7a0 4002fdb0 S /system/bin/sh
[task 2018-05-17T23:07:39.647Z] 23:07:39     INFO -       root      746   744   1100   460   00000000 40042014 R ps
[task 2018-05-17T23:07:39.647Z] 23:07:39     INFO -    systime: Thu May 17 16:07:39 PDT 2018
[task 2018-05-17T23:07:39.648Z] 23:07:39     INFO -    disk:
[task 2018-05-17T23:07:39.648Z] 23:07:39     INFO -       Filesystem               Size     Used     Free   Blksize
[task 2018-05-17T23:07:39.648Z] 23:07:39     INFO -       /data                  581.9M    54.5M   527.4M   4096
[task 2018-05-17T23:07:39.649Z] 23:07:39     INFO -       /system                214.4M   214.4M     0.0K   4096
[task 2018-05-17T23:07:39.649Z] 23:07:39     INFO -       /sdcard                597.6M    28.0K   597.6M   2048
[task 2018-05-17T23:07:39.650Z] 23:07:39     INFO -    os: sdk-eng 4.3.1 JLS36I eng.gbrown.20150308.182649 test-keys
[task 2018-05-17T23:07:39.650Z] 23:07:39     INFO -    id: emulator-5554
[task 2018-05-17T23:07:39.651Z] 23:07:39     INFO -  Test root: /sdcard/tests
[task 2018-05-17T23:07:40.190Z] 23:07:40     INFO -  chmod -R 777 /sdcard/tests/reftest/profile: Ignoring Error: args: adb -s emulator-5554 wait-for-device shell chmod -R 777 /sdcard/tests/reftest/profile; echo rc=$?, exitcode: 10, stdout: Unable to chmod /sdcard/tests/reftest/profile: No such file or directory
[task 2018-05-17T23:07:40.708Z] 23:07:40     INFO -  REFTEST INFO | Running with e10s: False
[task 2018-05-17T23:07:40.914Z] 23:07:40     INFO -  launch_application: am start -W -n org.mozilla.fennec_aurora/org.mozilla.gecko.BrowserApp -a android.intent.action.VIEW --es env9 MOZ_PROCESS_LOG=/tmp/tmpUMu7Cmpidlog --es env8 NO_EM_RESTART=1 --es args '-no-remote -profile /sdcard/tests/reftest/profile/' --es env3 MOZ_DISABLE_NONLOCAL_CONNECTIONS=1 --es env2 R_LOG_VERBOSE=1 --es env1 XPCOM_DEBUG_BREAK=stack --es env0 MOZ_CRASHREPORTER=1 --es env7 MOZ_CRASHREPORTER_NO_REPORT=1 --es env6 R_LOG_DESTINATION=stderr --es env5 MOZ_CRASHREPORTER_SHUTDOWN=1 --es env4 MOZ_IN_AUTOMATION=1 --es env10 R_LOG_LEVEL=6
[task 2018-05-17T23:07:54.658Z] 23:07:54     INFO -  INFO | automation.py | Application pid: 776
[task 2018-05-17T23:10:21.609Z] 23:10:21     INFO -  REFTEST INFO | Reading manifest http://10.0.2.2:8888/../jsreftest/tests/jstests.list
[task 2018-05-17T23:10:31.927Z] 23:10:31     INFO -  REFTEST INFO | Dumping JSON representation of sandbox
[task 2018-05-17T23:10:31.929Z] 23:10:31     INFO -  REFTEST INFO | {"isDebugBuild":true,"xulRuntime":{"widgetToolkit":"android","OS":"Android","XPCOMABI":"arm-eabi-gcc3"},"smallScreen":false,"d2d":false,"dwrite":false,"gpuProcess":false,"azureCairo":false,"azureSkia":true,"skiaContent":true,"azureSkiaGL":1,"contentSameGfxBackendAsCanvas":true,"layersGPUAccelerated":true,"d3d11":false,"d3d9":false,"layersOpenGL":true,"webrender":false,"layersOMTC":true,"advancedLayers":false,"layerChecksEnabled":true,"retainedDisplayList":false,"Android":true,"cocoaWidget":false,"gtkWidget":false,"qtWidget":false,"winWidget":false,"transparentScrollbars":false,"AndroidVersion":18,"AddressSanitizer":false,"webrtc":true,"retainedDisplayLists":false,"compareRetainedDisplayLists":false,"stylo":true,"skiaPdf":false,"release_or_beta":false,"http":{"userAgent":"Mozilla/5.0 (Android 4.3.1; Mobile; rv:62.0) Gecko/62.0 Firefox/62.0","appName":"Mozilla","appVersion":"5.0","platform":"Android 4.3.1","oscpu":"Linux armv7l","misc":"rv:62.0"},"haveTestPlugin":false,"windowsDefaultTheme":false,"nativeThemePref":false,"gpuProcessForceEnabled":false,"prefs":{},"browserIsRemote":false,"asyncPan":true,"usesRepeatResampling":false,"verify":false}
[task 2018-05-17T23:23:16.113Z] 23:23:16     INFO -  Browser unexpectedly found running. Killing...
[task 2018-05-17T23:23:16.114Z] 23:23:16     INFO -  REFTEST TEST-INFO | started process screentopng
[task 2018-05-17T23:23:16.576Z] 23:23:16     INFO -  REFTEST TEST-INFO | screentopng: exit 0
[task 2018-05-17T23:23:31.526Z] 23:23:31  WARNING -  TEST-UNEXPECTED-FAIL | remoteautomation.py | application timed out after 670 seconds with no output
Flags: needinfo?(gbrown)
We looked at this in February - comments 12 through 19 - but could not resolve it. Failures have continued fairly steadily. I think the recent spike was intermittent - probably nothing to worry about. I'd be happy to see this resolved, but not sure it's worth really digging into. Pinging :ahal just in case he wants to revisit.
Flags: needinfo?(gbrown) → needinfo?(ahal)
Duplicate of this bug: 1465721
There have been 42 failures in the last 7 days. All the failures occur on android-4-3-armv7-api16 / debug.

Recent log file:
https://treeherder.mozilla.org/logviewer.html#?job_id=182442192&repo=mozilla-inbound&lineNumber=1251

Snippet with the failure:

[task 2018-06-08T01:32:20.234Z] 01:32:20     INFO -  INFO | automation.py | Application pid: 772
[task 2018-06-08T01:34:12.148Z] 01:34:12     INFO -  REFTEST INFO | Reading manifest http://10.0.2.2:8888/../jsreftest/tests/jstests.list
[task 2018-06-08T01:34:12.148Z] 01:34:12     INFO -  REFTEST INFO | Dumping JSON representation of sandbox
[task 2018-06-08T01:34:12.150Z] 01:34:12     INFO -  REFTEST INFO | {"isDebugBuild":true,"xulRuntime":{"widgetToolkit":"android","OS":"Android","XPCOMABI":"arm-eabi-gcc3"},"smallScreen":false,"d2d":false,"dwrite":false,"gpuProcess":false,"azureCairo":false,"azureSkia":true,"skiaContent":true,"azureSkiaGL":1,"contentSameGfxBackendAsCanvas":true,"layersGPUAccelerated":true,"d3d11":false,"d3d9":false,"layersOpenGL":true,"webrender":false,"layersOMTC":true,"advancedLayers":false,"layerChecksEnabled":true,"retainedDisplayList":false,"Android":true,"cocoaWidget":false,"gtkWidget":false,"qtWidget":false,"winWidget":false,"transparentScrollbars":false,"AndroidVersion":18,"AddressSanitizer":false,"webrtc":true,"retainedDisplayLists":false,"compareRetainedDisplayLists":false,"skiaPdf":false,"release_or_beta":false,"http":{"userAgent":"Mozilla/5.0 (Android 4.3.1; Mobile; rv:62.0) Gecko/62.0 Firefox/62.0","appName":"Mozilla","appVersion":"5.0","platform":"Android 4.3.1","oscpu":"Linux armv7l","misc":"rv:62.0"},"haveTestPlugin":false,"windowsDefaultTheme":false,"nativeThemePref":false,"gpuProcessForceEnabled":false,"prefs":{},"browserIsRemote":false,"asyncPan":true,"usesRepeatResampling":false,"verify":false}
[task 2018-06-08T01:46:38.954Z] 01:46:38     INFO -  Browser unexpectedly found running. Killing...
[task 2018-06-08T01:46:38.954Z] 01:46:38     INFO -  REFTEST TEST-INFO | started process screentopng
[task 2018-06-08T01:46:39.311Z] 01:46:39     INFO -  REFTEST TEST-INFO | screentopng: exit 0
[task 2018-06-08T01:46:54.365Z] 01:46:54  WARNING -  TEST-UNEXPECTED-FAIL | remoteautomation.py | application timed out after 670 seconds with no output
[task 2018-06-08T01:46:54.367Z] 01:46:54     INFO -  INFO | automation.py | Application ran for: 0:14:44.168232
[task 2018-06-08T01:46:54.367Z] 01:46:54     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmpnN7xjtpidlog
Flags: needinfo?(gbrown)
Maybe there is nothing tricky happening here...

Almost all failures reported here are Android/debug jsreftest. The error message means there was no output seen in the test log for 670 seconds -- about 11 minutes. Look at any successful Android/debug jsreftest test log and you'll see something like this:

[task 2018-06-08T10:38:00.021Z] 10:38:00     INFO -  REFTEST INFO | Reading manifest http://10.0.2.2:8888/../jsreftest/tests/jstests.list
[task 2018-06-08T10:38:10.440Z] 10:38:10     INFO -  REFTEST INFO | Dumping JSON representation of sandbox
[task 2018-06-08T10:38:10.441Z] 10:38:10     INFO -  REFTEST INFO | {"isDebugBuild":true,"xulRuntime":{"widgetToolkit":"android","OS":"Android","XPCOMABI":"arm-eabi-gcc3"},"smallScreen":false,"d2d":false,"dwrite":false,"gpuProcess":false,"azureCairo":false,"azureSkia":true,"skiaContent":true,"azureSkiaGL":1,"contentSameGfxBackendAsCanvas":true,"layersGPUAccelerated":true,"d3d11":false,"d3d9":false,"layersOpenGL":true,"webrender":false,"layersOMTC":true,"advancedLayers":false,"layerChecksEnabled":true,"retainedDisplayList":false,"Android":true,"cocoaWidget":false,"gtkWidget":false,"qtWidget":false,"winWidget":false,"transparentScrollbars":false,"AndroidVersion":18,"AddressSanitizer":false,"webrtc":true,"retainedDisplayLists":false,"compareRetainedDisplayLists":false,"skiaPdf":false,"release_or_beta":false,"http":{"userAgent":"Mozilla/5.0 (Android 4.3.1; Mobile; rv:62.0) Gecko/62.0 Firefox/62.0","appName":"Mozilla","appVersion":"5.0","platform":"Android 4.3.1","oscpu":"Linux armv7l","misc":"rv:62.0"},"haveTestPlugin":false,"windowsDefaultTheme":false,"nativeThemePref":false,"gpuProcessForceEnabled":false,"prefs":{},"browserIsRemote":false,"asyncPan":true,"usesRepeatResampling":false,"verify":false}
[task 2018-06-08T10:50:44.810Z] 10:50:44     INFO -  REFTEST INFO | Running chunk 1 out of 100 chunks.  tests 1-338/33643

...there is a 10+ minute pause between logging the sandbox and the start of test runs. I suppose we are reading the test manifests during this time. Does it really take 10+ minutes to read manifests? Yes, it's Android, and it is Debug, but 10+ minutes?! Will try to verify a bit more deeply, but maybe this is just a very slow operation at approximately the no output timeout boundary.
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
Flags: needinfo?(ahal)
Finally, this bug makes sense! Manifest parsing for Android/debug jsreftests routinely requires over 10 minutes, all without logging; if it goes over 670 seconds, we generate another intermittent failure here. 

The logging :ahal added earlier probably helped for a while by logging at the start and end of the "meat" of parsing. This patch logs throughout the manifest parsing process, once every 5000 parsed tests -- about once every 90 seconds for Android/Debug jsreftests.

I've used logger.debug(), which does not show up in normal test logs. These records appear in the raw log on the device, so cancel the no-output timer on the host, even though they are later filtered and not displayed.

On try, I see no no-output timeouts with this extra logging:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=78dcb91c458bdb75be3c82fcbf412c499d3860d2
Attachment #8985105 - Flags: review?(ahal)
Comment on attachment 8985105 [details] [diff] [review]
add periodic logging during manifest parsing to avoid no-output timeout

Review of attachment 8985105 [details] [diff] [review]:
-----------------------------------------------------------------

Great catch, thanks for finding this! I wonder why this wasn't an issue before my refactor though. Maybe I removed a log statement or something.

::: layout/tools/reftest/manifest.jsm
@@ +687,5 @@
> +    // No-output timeouts during manifest parsing have been a problem for
> +    // jsreftests on Android/debug. Any logging resets the no-output timer,
> +    // even debug logging which is normally not displayed.
> +    if ((g.urls.length % 5000) == 0)
> +        g.logger.debug(g.urls.length + " tests found...");

Could we log the name of the manifest being parsed at the top of the 'ReadManifest' function instead (still at the debug level)? Or will that be too verbose. I think doing that might be more useful.
Attachment #8985105 - Flags: review?(ahal) → review+
(In reply to Andrew Halberstadt [:ahal] from comment #44)
> Great catch, thanks for finding this! I wonder why this wasn't an issue
> before my refactor though. Maybe I removed a log statement or something.

Yes, I think either a change in logging during manifest parsing, or your change made manifest parsing run a little slower, at least on Android/debug.

> Could we log the name of the manifest being parsed at the top of the
> 'ReadManifest' function instead (still at the debug level)? Or will that be
> too verbose. I think doing that might be more useful.

I had a look at that. There can be hundreds of ReadManifest calls, whereas the 5000 tests logging is only hit a half dozen times typically: I'd prefer to keep this tweak as light-weight as possible, so I prefer the 5000 tests logging.
Some of these failures have been reported in other bugs, like...
Blocks: 1426822
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a75cbfcc1a0e
Prevent intermittent timeouts during reftest manifest parsing; r=ahal
https://hg.mozilla.org/mozilla-central/rev/a75cbfcc1a0e
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → Firefox 62
See Also: → 1485222
You need to log in before you can comment on or make changes to this bug.