Closed
Bug 1432037
Opened 6 years ago
Closed 6 years ago
Intermittent talos timeout on startup, " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling
Categories
(Testing :: Talos, defect, P5)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: intermittent-bug-filer, Unassigned)
Details
(Keywords: intermittent-failure)
Attachments
(1 file)
Filed by: apavel [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=157610011&repo=mozilla-central https://queue.taskcluster.net/v1/task/LtiB16UDQVClIIG1UG-ocQ/runs/0/artifacts/public/logs/live_backing.log 02:57:17 INFO - PID 1676 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath 02:57:18 INFO - PID 1676 | 2018-01-21 02:57:18.146 plugin-container[1702:11423] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x9d53, name = 'com.apple.tsm.portname' 02:57:18 INFO - PID 1676 | See /usr/include/servers/bootstrap_defs.h for the error codes. 02:57:19 INFO - PID 1676 | Open toolbox on 'console.openwithcache' 02:57:21 INFO - PID 1676 | Garbage collect 02:57:23 INFO - PID 1676 | 02:57:23 INFO - PID 1676 | ###!!! [Parent][DispatchAsyncMessage] Error: PClientSource::Msg_ExecutionReady Route error: message sent to unknown actor ID 02:57:23 INFO - PID 1676 | 02:57:23 INFO - PID 1676 | Garbage collect 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel::SendAndWait] Error: Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Parent][MessageChannel] Error: (msgtype=0x150082,name=PBrowser::Msg_Destroy) Closed channel: cannot send/recv 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x15002B,name=PBrowser::Msg_SetInputContext) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x15002B,name=PBrowser::Msg_SetInputContext) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150028,name=PBrowser::Msg_EnableDisableCommands) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Closed channel: cannot send/recv 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x150001,name=PBrowser::Msg_AsyncMessage) Closed channel: cannot send/recv 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | ###!!! [Child][MessageChannel] Error: (msgtype=0x15002B,name=PBrowser::Msg_SetInputContext) Closed channel: cannot send/recv 02:57:25 INFO - PID 1676 | 02:57:25 INFO - PID 1676 | Assertion failure 02:57:25 INFO - PID 1676 | assert@chrome://browser/content/tabbrowser.xml:4697:46 02:57:25 INFO - PID 1676 | finish@chrome://browser/content/tabbrowser.xml:4548:15 02:57:25 INFO - PID 1676 | postActions@chrome://browser/content/tabbrowser.xml:4852:17 02:57:25 INFO - PID 1676 | onUnloadTimeout@chrome://browser/content/tabbrowser.xml:4866:15 02:57:25 INFO - PID 1676 | suppressDisplayPortAndQueueUnload/this.unloadTimer<@chrome://browser/content/tabbrowser.xml:5181:54 02:57:25 INFO - PID 1676 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath 02:57:25 INFO - PID 1676 | 2018-01-21 02:57:25.965 plugin-container[1703:11479] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0xa04b, name = 'com.apple.tsm.portname' 02:57:25 INFO - PID 1676 | See /usr/include/servers/bootstrap_defs.h for the error codes. 03:57:26 INFO - Automation Error: mozprocess timed out after 3600 seconds running ['/Users/cltbld/tasks/task_1516531822/build/venv/bin/python', '/Users/cltbld/tasks/task_1516531822/build/tests/talos/talos/run_tests.py', '--suite', 'g2-e10s', '--executablePath', '/Users/cltbld/tasks/task_1516531822/build/application/Firefox Nightly.app/Contents/MacOS/firefox', '--symbolsPath', 'https://queue.taskcluster.net/v1/task/ZDIj12-QQFq98lgOCFxmAg/artifacts/public/build/target.crashreporter-symbols.zip', '--title', 't-yosemite-r7-0216', '--geckoProfile', '--webServer', 'localhost', '--webServer', 'localhost', '--webServer', 'localhost', '--webServer', 'localhost', '--log-tbpl-level=debug', '--log-errorsummary=/Users/cltbld/tasks/task_1516531822/build/blobber_upload_dir/g2-e10s_errorsummary.log', '--log-raw=/Users/cltbld/tasks/task_1516531822/build/blobber_upload_dir/g2-e10s_raw.log'] 03:57:26 ERROR - timed out after 3600 seconds of no output 03:57:26 ERROR - Return code: -15 03:57:26 WARNING - setting return code to -15 03:57:26 ERROR - # TBPL FAILURE # 03:57:26 WARNING - setting return code to 2 03:57:26 INFO - Running pre test command check_screen_resolution with 'bash -c screenresolution get && screenresolution list && system_profiler SPDisplaysDataType' 03:57:26 INFO - Running command: ('bash', '-c', 'screenresolution get && screenresolution list && system_profiler SPDisplaysDataType') in /Users/cltbld/tasks/task_1516531822/build 03:57:26 INFO - Copy/paste: bash -c "screenresolution get && screenresolution list && system_profiler SPDisplaysDataType" 03:57:26 INFO - 2018-01-21 03:57:26.029 screenresolution[1746:15073] starting screenresolution argv=screenresolution get 03:57:26 INFO - 2018-01-21 03:57:26.032 screenresolution[1746:15073] Display 0: 1600x1200x32@60 03:57:26 INFO - 2018-01-21 03:57:26.038 screenresolution[1747:15076] starting screenresolution argv=screenresolution list 03:57:26 INFO - Available Modes on Display 0 03:57:26 INFO - 1600x1200x32@60 1152x870x32@75 1280x1024x32@75 03:57:26 INFO - 1024x768x32@75 1024x768x32@70 1024x768x32@60 03:57:26 INFO - 800x600x32@75 800x600x32@60 640x480x32@75 03:57:26 INFO - 640x480x32@60 1280x1024x32@60 1280x960x32@60 03:57:26 INFO - 1280x960x32@75 1344x1008x32@60 Graphics/Displays: 03:57:26 INFO - Intel Iris: 03:57:26 INFO - Chipset Model: Intel Iris 03:57:26 INFO - Type: GPU 03:57:26 INFO - Bus: Built-In 03:57:26 INFO - VRAM (Dynamic, Max): 1536 MB 03:57:26 INFO - Vendor: Intel (0x8086) 03:57:26 INFO - Device ID: 0x0a2e 03:57:26 INFO - Revision ID: 0x0009 03:57:26 INFO - Displays: 03:57:26 INFO - Display: 03:57:26 INFO - Resolution: 1600 x 1200 @ 60 Hz 03:57:26 INFO - Pixel Depth: 32-Bit Color (ARGB8888) 03:57:26 INFO - Display Serial Number: 1600x1200 60 03:57:26 INFO - Main Display: Yes 03:57:26 INFO - Mirror: Off 03:57:26 INFO - Online: Yes 03:57:26 INFO - Rotation: Supported 03:57:26 INFO - Return code: 0 03:57:26 INFO - Running post-action listener: _package_coverage_data 03:57:26 INFO - Running post-action listener: _resource_record_post_action 03:57:26 INFO - [mozharness: 2018-01-21 11:57:26.312895Z] Finished run-tests step (success) 03:57:26 INFO - Running post-run listener: _resource_record_post_run 03:57:26 INFO - Total resource usage - Wall time: 3767s; CPU: 43.0%; Read bytes: 283304960; Write bytes: 406849024; Read time: 18360; Write time: 1677 03:57:26 INFO - TinderboxPrint: CPU usage<br/>42.6% 03:57:26 INFO - TinderboxPrint: I/O read bytes / time<br/>283,304,960 / 18,360 03:57:26 INFO - TinderboxPrint: I/O write bytes / time<br/>406,849,024 / 1,677 03:57:26 INFO - TinderboxPrint: CPU idle<br/>8,647.5 (57.4%) 03:57:26 INFO - TinderboxPrint: CPU system<br/>3,581.0 (23.8%) 03:57:26 INFO - TinderboxPrint: CPU user<br/>2,838.9 (18.8%) 03:57:26 INFO - TinderboxPrint: Swap in / out<br/>286,097,408 / 0 03:57:26 INFO - install - Wall time: 23s; CPU: 53.0%; Read bytes: 240678912; Write bytes: 233997824; Read time: 17873; Write time: 495 03:57:26 INFO - setup-mitmproxy - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 03:57:26 INFO - run-tests - Wall time: 3744s; CPU: 43.0%; Read bytes: 38353920; Write bytes: 161767424; Read time: 240; Write time: 1154 03:57:28 INFO - Running post-run listener: _upload_blobber_files 03:57:28 WARNING - Blob upload gear skipped. Missing cmdline options. 03:57:28 INFO - Running post-run listener: copy_logs_to_upload_dir 03:57:28 INFO - Copying logs to upload dir... 03:57:28 INFO - mkdir: /Users/cltbld/tasks/task_1516531822/build/upload/logs 03:57:28 INFO - Copying logs to upload dir... 03:57:28 WARNING - returning nonzero exit status 2 [taskcluster 2018-01-21T11:57:28.512Z] Exit Code: 2 [taskcluster 2018-01-21T11:57:28.512Z] === Task Finished ===
Comment 1•6 years ago
|
||
rwood: we'll probably misstar the two like crazy, but this bug was an attempt to separate out the fact that your new baby, talos-with-profiling, mostly times out on g2. Perhaps that provides clues for bug 1427930, perhaps it's a completely separate thing, dunno.
Flags: needinfo?(rwood)
Summary: Intermittent nightly Automation Error: mozprocess timed out after 3600 seconds running ['/Users/cltbld/tasks/task_1516531822/build/venv/bin/python', '/Users/cltbld/tasks/task_1516531822/build/tests/talos/talos/run_tests.py', '--suite' → Intermittent talos timeout on startup, "command timed out : 3600 seconds without output , attempting to kill", " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling timeout on startup
Updated•6 years ago
|
Summary: Intermittent talos timeout on startup, "command timed out : 3600 seconds without output , attempting to kill", " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling timeout on startup → Intermittent talos timeout on startup, " timed out after 3600 seconds of no output ": g2 OS X nightly with profiling
Comment hidden (Intermittent Failures Robot) |
Comment 3•6 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #1) > rwood: we'll probably misstar the two like crazy, but this bug was an > attempt to separate out the fact that your new baby, talos-with-profiling, > mostly times out on g2. Perhaps that provides clues for bug 1427930, perhaps > it's a completely separate thing, dunno. Thanks Phil. Some kind of assertion error during the first cycle and then the test hangs for an hour and times out: 02:57:25 INFO - PID 1676 | Assertion failure 02:57:25 INFO - PID 1676 | assert@chrome://browser/content/tabbrowser.xml:4697:46 02:57:25 INFO - PID 1676 | finish@chrome://browser/content/tabbrowser.xml:4548:15 02:57:25 INFO - PID 1676 | postActions@chrome://browser/content/tabbrowser.xml:4852:17 02:57:25 INFO - PID 1676 | onUnloadTimeout@chrome://browser/content/tabbrowser.xml:4866:15 .... 03:57:26 ERROR - timed out after 3600 seconds of no output First I'll see if I can reproduce this locally.
Flags: needinfo?(rwood)
Comment 4•6 years ago
|
||
Can't reproduce it locally, however in the production logs it looks like a consistent problem with: 08:01:27 INFO - PID 1926 | Open toolbox on 'console.openwithcache' Pushing to try with that specific subtest disabled (but geckoProfile enabled).
Comment 5•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=42eab9d5e863b844cebe96ab7eb0e4c705f9d884
Comment 6•6 years ago
|
||
(In reply to Robert Wood [:rwood] from comment #5) > https://treeherder.mozilla.org/#/jobs?repo=try&revision=42eab9d5e863b844cebe96ab7eb0e4c705f9d884 All green - so it's the 'console.openwithcache' subtest that is the issue. :bgrins, as the dev contact for DAMP, can you please have a look? There's a problem with the 'console.openwithcache' subtest on OSX (at least). The test hangs with the assertion failure noted in comment 3. This failure is very consistent when the test is run with geckoProfile enabled. Thanks!
Flags: needinfo?(bgrinstead)
Comment 7•6 years ago
|
||
I haven't figured it out yet, but I did a try push with some extra logging and where I bumped the number of messages it was waiting for down to 10 (from 100) and it's still orange: https://treeherder.mozilla.org/#/jobs?repo=try&revision=47105c1fa55907afbf79bf997299a27e87317b41&selectedJob=157813911. Oddly the info() calls and any reference to openwithcache aren't in the raw log - I'm not sure if this is an issue with the logging being dropped or if the function really isn't getting called for some reason.
Comment 8•6 years ago
|
||
Thanks for looking into this Brian. The only way I can get any logging to show up in my local console from inside damp.js _consoleOpenWithCachedMessagesTest() is to use: await this.openToolboxAndLog("*RW* inside damp!");
Comment hidden (Intermittent Failures Robot) |
Comment 10•6 years ago
|
||
Alex, do you have an idea on how to debug this issue that mostly occurs only when profiling? See Comments 6-8
Flags: needinfo?(bgrinstead) → needinfo?(poirot.alex)
Comment hidden (Intermittent Failures Robot) |
Comment 12•6 years ago
|
||
I think I got similar failure when running it locally, there was some additional exception printed in the browser console: NS_ERROR_UNEXPECTED: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIScriptSecurityManager.getLoadContextCodebasePrincipal] remote-browser.xml:406 I'm wondering if that's just OOM because of profiler buffer. Here it really looks like a OOM and it is always hard to debug as it just crashes randomly and then result in exception at arbitrary locations.
Flags: needinfo?(poirot.alex)
Comment 13•6 years ago
|
||
It may be helpful to see if the exception disappear with a smaller buffer (--geckoProfileEntries)
Comment hidden (mozreview-request) |
Comment 15•6 years ago
|
||
I reduced the occurence of the local exception with this patch. Is there a particular try syntax to reproduce these g2 profiling jobs?
Updated•6 years ago
|
Flags: needinfo?(rwood)
Comment 16•6 years ago
|
||
I reused bgrins try syntax from comment 7 and have been able to have 5 green runs: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab88126c9ea648be84fd2c93ad50a8d5f3bf8a1f&selectedJob=161068999 I requested some more runs, but it looks good.
Flags: needinfo?(rwood)
Comment 17•6 years ago
|
||
Comment on attachment 8948674 [details] Bug 1432037 - Wait for tab animation end when closing test tabs. This patch is basically reusing the same technique used in tart over here: https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/tart/addon/content/tart.js#45 It relates to BrowserCloseTabOrWindow call we both do in damp and tart. TBH, I'm not sure BrowserCloseTabOrWindow is the best API to use to close a tab, there may be a better API that is not racy like this one.
Attachment #8948674 -
Flags: review?(bgrinstead)
Comment hidden (Intermittent Failures Robot) |
Comment 19•6 years ago
|
||
mozreview-review |
Comment on attachment 8948674 [details] Bug 1432037 - Wait for tab animation end when closing test tabs. https://reviewboard.mozilla.org/r/218078/#review225280 Nice find! Works for me if it fixes the intermittent
Attachment #8948674 -
Flags: review?(bgrinstead) → review+
Comment 20•6 years ago
|
||
Pushed by apoirot@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/62b6738695de Wait for tab animation end when closing test tabs. r=bgrins
Comment 21•6 years ago
|
||
Backed out for talos damp crashes at google_breakpad::ExceptionHandler::WriteMinidump() Push that started the failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=62b6738695defa58dbee72681146af2f0cc65c5c Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=161749537&repo=autoland&lineNumber=3594 Backout: https://hg.mozilla.org/integration/autoland/rev/18f32284c48e96e7dd98c900615a127b586fb612
Flags: needinfo?(poirot.alex)
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(poirot.alex)
Resolution: --- → WORKSFORME
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•