Closed Bug 1432037 Opened 6 years ago Closed 6 years ago

Intermittent talos timeout on startup, " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling

Categories

(Testing :: Talos, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Filed by: apavel [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=157610011&repo=mozilla-central

https://queue.taskcluster.net/v1/task/LtiB16UDQVClIIG1UG-ocQ/runs/0/artifacts/public/logs/live_backing.log

02:57:17     INFO -  PID 1676 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
02:57:18     INFO -  PID 1676 | 2018-01-21 02:57:18.146 plugin-container[1702:11423] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x9d53, name = 'com.apple.tsm.portname'
02:57:18     INFO -  PID 1676 | See /usr/include/servers/bootstrap_defs.h for the error codes.
02:57:19     INFO -  PID 1676 | Open toolbox on 'console.openwithcache'
02:57:21     INFO -  PID 1676 | Garbage collect
02:57:23     INFO -  PID 1676 |
02:57:23     INFO -  PID 1676 | ###!!! [Parent][DispatchAsyncMessage] Error: PClientSource::Msg_ExecutionReady Route error: message sent to unknown actor ID
02:57:23     INFO -  PID 1676 |
02:57:23     INFO -  PID 1676 | Garbage collect
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel::SendAndWait] Error: Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Parent][MessageChannel] Error: (msgtype=0x150082,name=PBrowser::Msg_Destroy) Closed channel: cannot send/recv
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x15002B,name=PBrowser::Msg_SetInputContext) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x15002B,name=PBrowser::Msg_SetInputContext) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150028,name=PBrowser::Msg_EnableDisableCommands) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Closed channel: cannot send/recv
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Closed channel: cannot send/recv
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x15002B,name=PBrowser::Msg_SetInputContext) Closed channel: cannot send/recv
02:57:25     INFO -  PID 1676 |
02:57:25     INFO -  PID 1676 | Assertion failure
02:57:25     INFO -  PID 1676 | assert@chrome://browser/content/tabbrowser.xml:4697:46
02:57:25     INFO -  PID 1676 | finish@chrome://browser/content/tabbrowser.xml:4548:15
02:57:25     INFO -  PID 1676 | postActions@chrome://browser/content/tabbrowser.xml:4852:17
02:57:25     INFO -  PID 1676 | onUnloadTimeout@chrome://browser/content/tabbrowser.xml:4866:15
02:57:25     INFO -  PID 1676 | suppressDisplayPortAndQueueUnload/this.unloadTimer<@chrome://browser/content/tabbrowser.xml:5181:54
02:57:25     INFO -  PID 1676 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
02:57:25     INFO -  PID 1676 | 2018-01-21 02:57:25.965 plugin-container[1703:11479] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0xa04b, name = 'com.apple.tsm.portname'
02:57:25     INFO -  PID 1676 | See /usr/include/servers/bootstrap_defs.h for the error codes.
03:57:26     INFO - Automation Error: mozprocess timed out after 3600 seconds running ['/Users/cltbld/tasks/task_1516531822/build/venv/bin/python', '/Users/cltbld/tasks/task_1516531822/build/tests/talos/talos/run_tests.py', '--suite', 'g2-e10s', '--executablePath', '/Users/cltbld/tasks/task_1516531822/build/application/Firefox Nightly.app/Contents/MacOS/firefox', '--symbolsPath', 'https://queue.taskcluster.net/v1/task/ZDIj12-QQFq98lgOCFxmAg/artifacts/public/build/target.crashreporter-symbols.zip', '--title', 't-yosemite-r7-0216', '--geckoProfile', '--webServer', 'localhost', '--webServer', 'localhost', '--webServer', 'localhost', '--webServer', 'localhost', '--log-tbpl-level=debug', '--log-errorsummary=/Users/cltbld/tasks/task_1516531822/build/blobber_upload_dir/g2-e10s_errorsummary.log', '--log-raw=/Users/cltbld/tasks/task_1516531822/build/blobber_upload_dir/g2-e10s_raw.log']
03:57:26    ERROR - timed out after 3600 seconds of no output
03:57:26    ERROR - Return code: -15
03:57:26  WARNING - setting return code to -15
03:57:26    ERROR - # TBPL FAILURE #
03:57:26  WARNING - setting return code to 2
03:57:26     INFO - Running pre test command check_screen_resolution with 'bash -c screenresolution get && screenresolution list && system_profiler SPDisplaysDataType'
03:57:26     INFO - Running command: ('bash', '-c', 'screenresolution get && screenresolution list && system_profiler SPDisplaysDataType') in /Users/cltbld/tasks/task_1516531822/build
03:57:26     INFO - Copy/paste: bash -c "screenresolution get && screenresolution list && system_profiler SPDisplaysDataType"
03:57:26     INFO -  2018-01-21 03:57:26.029 screenresolution[1746:15073] starting screenresolution argv=screenresolution get
03:57:26     INFO -  2018-01-21 03:57:26.032 screenresolution[1746:15073] Display 0: 1600x1200x32@60
03:57:26     INFO -  2018-01-21 03:57:26.038 screenresolution[1747:15076] starting screenresolution argv=screenresolution list
03:57:26     INFO -  Available Modes on Display 0
03:57:26     INFO -    1600x1200x32@60      	1152x870x32@75       	1280x1024x32@75
03:57:26     INFO -    1024x768x32@75       	1024x768x32@70       	1024x768x32@60
03:57:26     INFO -    800x600x32@75        	800x600x32@60        	640x480x32@75
03:57:26     INFO -    640x480x32@60        	1280x1024x32@60      	1280x960x32@60
03:57:26     INFO -    1280x960x32@75       	1344x1008x32@60      Graphics/Displays:
03:57:26     INFO -      Intel Iris:
03:57:26     INFO -        Chipset Model: Intel Iris
03:57:26     INFO -        Type: GPU
03:57:26     INFO -        Bus: Built-In
03:57:26     INFO -        VRAM (Dynamic, Max): 1536 MB
03:57:26     INFO -        Vendor: Intel (0x8086)
03:57:26     INFO -        Device ID: 0x0a2e
03:57:26     INFO -        Revision ID: 0x0009
03:57:26     INFO -        Displays:
03:57:26     INFO -          Display:
03:57:26     INFO -            Resolution: 1600 x 1200 @ 60 Hz
03:57:26     INFO -            Pixel Depth: 32-Bit Color (ARGB8888)
03:57:26     INFO -            Display Serial Number: 1600x1200 60
03:57:26     INFO -            Main Display: Yes
03:57:26     INFO -            Mirror: Off
03:57:26     INFO -            Online: Yes
03:57:26     INFO -            Rotation: Supported
03:57:26     INFO - Return code: 0
03:57:26     INFO - Running post-action listener: _package_coverage_data
03:57:26     INFO - Running post-action listener: _resource_record_post_action
03:57:26     INFO - [mozharness: 2018-01-21 11:57:26.312895Z] Finished run-tests step (success)
03:57:26     INFO - Running post-run listener: _resource_record_post_run
03:57:26     INFO - Total resource usage - Wall time: 3767s; CPU: 43.0%; Read bytes: 283304960; Write bytes: 406849024; Read time: 18360; Write time: 1677
03:57:26     INFO - TinderboxPrint: CPU usage<br/>42.6%
03:57:26     INFO - TinderboxPrint: I/O read bytes / time<br/>283,304,960 / 18,360
03:57:26     INFO - TinderboxPrint: I/O write bytes / time<br/>406,849,024 / 1,677
03:57:26     INFO - TinderboxPrint: CPU idle<br/>8,647.5 (57.4%)
03:57:26     INFO - TinderboxPrint: CPU system<br/>3,581.0 (23.8%)
03:57:26     INFO - TinderboxPrint: CPU user<br/>2,838.9 (18.8%)
03:57:26     INFO - TinderboxPrint: Swap in / out<br/>286,097,408 / 0
03:57:26     INFO - install - Wall time: 23s; CPU: 53.0%; Read bytes: 240678912; Write bytes: 233997824; Read time: 17873; Write time: 495
03:57:26     INFO - setup-mitmproxy - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
03:57:26     INFO - run-tests - Wall time: 3744s; CPU: 43.0%; Read bytes: 38353920; Write bytes: 161767424; Read time: 240; Write time: 1154
03:57:28     INFO - Running post-run listener: _upload_blobber_files
03:57:28  WARNING - Blob upload gear skipped. Missing cmdline options.
03:57:28     INFO - Running post-run listener: copy_logs_to_upload_dir
03:57:28     INFO - Copying logs to upload dir...
03:57:28     INFO - mkdir: /Users/cltbld/tasks/task_1516531822/build/upload/logs
03:57:28     INFO - Copying logs to upload dir...
03:57:28  WARNING - returning nonzero exit status 2
[taskcluster 2018-01-21T11:57:28.512Z] Exit Code: 2
[taskcluster 2018-01-21T11:57:28.512Z] === Task Finished ===
rwood: we'll probably misstar the two like crazy, but this bug was an attempt to separate out the fact that your new baby, talos-with-profiling, mostly times out on g2. Perhaps that provides clues for bug 1427930, perhaps it's a completely separate thing, dunno.
Flags: needinfo?(rwood)
Summary: Intermittent nightly Automation Error: mozprocess timed out after 3600 seconds running ['/Users/cltbld/tasks/task_1516531822/build/venv/bin/python', '/Users/cltbld/tasks/task_1516531822/build/tests/talos/talos/run_tests.py', '--suite' → Intermittent talos timeout on startup, "command timed out : 3600 seconds without output , attempting to kill", " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling timeout on startup
Summary: Intermittent talos timeout on startup, "command timed out : 3600 seconds without output , attempting to kill", " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling timeout on startup → Intermittent talos timeout on startup, " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling
(In reply to Phil Ringnalda (:philor) from comment #1)
> rwood: we'll probably misstar the two like crazy, but this bug was an
> attempt to separate out the fact that your new baby, talos-with-profiling,
> mostly times out on g2. Perhaps that provides clues for bug 1427930, perhaps
> it's a completely separate thing, dunno.

Thanks Phil.

Some kind of assertion error during the first cycle and then the test hangs for an hour and times out:

02:57:25     INFO -  PID 1676 | Assertion failure
02:57:25     INFO -  PID 1676 | assert@chrome://browser/content/tabbrowser.xml:4697:46
02:57:25     INFO -  PID 1676 | finish@chrome://browser/content/tabbrowser.xml:4548:15
02:57:25     INFO -  PID 1676 | postActions@chrome://browser/content/tabbrowser.xml:4852:17
02:57:25     INFO -  PID 1676 | onUnloadTimeout@chrome://browser/content/tabbrowser.xml:4866:15
....
03:57:26    ERROR - timed out after 3600 seconds of no output

First I'll see if I can reproduce this locally.
Flags: needinfo?(rwood)
Can't reproduce it locally, however in the production logs it looks like a consistent problem with:

08:01:27     INFO -  PID 1926 | Open toolbox on 'console.openwithcache'

Pushing to try with that specific subtest disabled (but geckoProfile enabled).
(In reply to Robert Wood [:rwood] from comment #5)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=42eab9d5e863b844cebe96ab7eb0e4c705f9d884

All green - so it's the 'console.openwithcache' subtest that is the issue.

:bgrins, as the dev contact for DAMP, can you please have a look? There's a problem with the 'console.openwithcache' subtest on OSX (at least). The test hangs with the assertion failure noted in comment 3. This failure is very consistent when the test is run with geckoProfile enabled. Thanks!
Flags: needinfo?(bgrinstead)
I haven't figured it out yet, but I did a try push with some extra logging and where I bumped the number of messages it was waiting for down to 10 (from 100) and it's still orange: https://treeherder.mozilla.org/#/jobs?repo=try&revision=47105c1fa55907afbf79bf997299a27e87317b41&selectedJob=157813911. Oddly the info() calls and any reference to openwithcache aren't in the raw log - I'm not sure if this is an issue with the logging being dropped or if the function really isn't getting called for some reason.
Thanks for looking into this Brian. The only way I can get any logging to show up in my local console from inside damp.js _consoleOpenWithCachedMessagesTest() is to use:

await this.openToolboxAndLog("*RW* inside damp!");
Alex, do you have an idea on how to debug this issue that mostly occurs only when profiling? See Comments 6-8
Flags: needinfo?(bgrinstead) → needinfo?(poirot.alex)
I think I got similar failure when running it locally, there was some additional exception printed in the browser console:
NS_ERROR_UNEXPECTED: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIScriptSecurityManager.getLoadContextCodebasePrincipal]  remote-browser.xml:406

I'm wondering if that's just OOM because of profiler buffer. Here it really looks like a OOM and it is always hard to debug as it just crashes randomly and then result in exception at arbitrary locations.
Flags: needinfo?(poirot.alex)
It may be helpful to see if the exception disappear with a smaller buffer (--geckoProfileEntries)
I reduced the occurence of the local exception with this patch.
Is there a particular try syntax to reproduce these g2 profiling jobs?
Flags: needinfo?(rwood)
I reused bgrins try syntax from comment 7 and have been able to have 5 green runs:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab88126c9ea648be84fd2c93ad50a8d5f3bf8a1f&selectedJob=161068999

I requested some more runs, but it looks good.
Flags: needinfo?(rwood)
Comment on attachment 8948674 [details]
Bug 1432037 - Wait for tab animation end when closing test tabs.

This patch is basically reusing the same technique used in tart over here:
https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/tart/addon/content/tart.js#45

It relates to BrowserCloseTabOrWindow call we both do in damp and tart.
TBH, I'm not sure BrowserCloseTabOrWindow is the best API to use to close a tab,
there may be a better API that is not racy like this one.
Attachment #8948674 - Flags: review?(bgrinstead)
Comment on attachment 8948674 [details]
Bug 1432037 - Wait for tab animation end when closing test tabs.

https://reviewboard.mozilla.org/r/218078/#review225280

Nice find! Works for me if it fixes the intermittent
Attachment #8948674 - Flags: review?(bgrinstead) → review+
Pushed by apoirot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/62b6738695de
Wait for tab animation end when closing test tabs. r=bgrins
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(poirot.alex)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: