Closed Bug 1440995 Opened 6 years ago Closed 6 years ago

Intermittent talos OS X Aborting task - max run time exceeded! followed by Exit Code: -1

Categories

(Testing :: Talos, defect)

Version 3
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1445580

People

(Reporter: apavel, Unassigned)

References

Details

(Whiteboard: [stockwell infra])

Revision: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=bfe62272d2a21f9d10d45e5aaa680f5c735604ae

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=164190404&repo=mozilla-central


02:59:47     INFO -  TEST-INFO | started process 2511 (/Users/cltbld/tasks/task_1519524388/build/application/Firefox Nightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/9g/1jk_l9qn7pn68yzgn2443qzc00000w/T/tmpQmZUGh/profile http://localhost:49225/startup_test/tresize/addon/content/tresize-test.html?gecko_profile_entries=1000000&gecko_profile_threads=GeckoMain%2CCompositor&gecko_profile_dir=%2Fvar%2Ffolders%2F9g%2F1jk_l9qn7pn68yzgn2443qzc00000w%2FT%2Ftmp1WLxYm&gecko_profile_interval=2 http://localhost:49225/startup_test/tresize/addon/content/tresize-test.html?gecko_profile_entries=1000000&gecko_profile_threads=GeckoMain%2CCompositor&gecko_profile_dir=%2Fvar%2Ffolders%2F9g%2F1jk_l9qn7pn68yzgn2443qzc00000w%2FT%2Ftmp1WLxYm&gecko_profile_interval=2)
02:59:47     INFO -  PID 2511 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
02:59:48     INFO -  PID 2511 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
02:59:49     INFO -  PID 2511 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
02:59:49     INFO -  PID 2511 | 2018-02-25 02:59:49.392 plugin-container[2512:33809] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x9d3b, name = 'com.apple.tsm.portname'
02:59:49     INFO -  PID 2511 | See /usr/include/servers/bootstrap_defs.h for the error codes.
02:59:49     INFO -  PID 2511 | 2018-02-25 02:59:49.421 plugin-container[2513:33831] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x924b, name = 'com.apple.tsm.portname'
02:59:49     INFO -  PID 2511 | See /usr/include/servers/bootstrap_defs.h for the error codes.
02:59:50     INFO -  PID 2511 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath
02:59:50     INFO -  PID 2511 | 2018-02-25 02:59:50.758 plugin-container[2514:33941] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x963f, name = 'com.apple.tsm.portname'
02:59:50     INFO -  PID 2511 | See /usr/include/servers/bootstrap_defs.h for the error codes.
03:00:03     INFO -  PID 2511 | 2018-02-25 03:00:02.999 firefox[2511:33804] Persistent UI failed to open file file:///Users/cltbld/Library/Saved%20Application%20State/org.mozilla.nightly.savedState/window_1.data: No such file or directory (2)
03:00:05     INFO -  PID 2511 | __start_report28.62__end_report
03:00:05     INFO -  PID 2511 | __startTimestamp1519556405340__endTimestamp
03:00:06     INFO -  PID 2511 | *** UTM:SVC TimerManager:registerTimer called after profile-before-change notification. Ignoring timer registration for id: telemetry_modules_ping
03:00:06     INFO -  TEST-INFO | 2511: exit 0
[taskcluster 2018-02-25T11:00:17.346Z] Aborting task - max run time exceeded!
[taskcluster 2018-02-25T11:00:17.350Z] Exit Code: -1
[taskcluster 2018-02-25T11:00:17.350Z] === Task Finished ===
[taskcluster 2018-02-25T11:00:17.350Z] Task Duration: 14m58.809909888s
[taskcluster 2018-02-25T11:00:17.980Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:18.439Z] Uploading artifact public/logs/talos_critical.log from file logs/talos_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:18.797Z] Uploading artifact public/logs/talos_error.log from file logs/talos_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:19.176Z] Uploading artifact public/logs/talos_fatal.log from file logs/talos_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:19.569Z] Uploading artifact public/logs/talos_info.log from file logs/talos_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:19.993Z] Uploading artifact public/logs/talos_raw.log from file logs/talos_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:20.539Z] Uploading artifact public/logs/talos_warning.log from file logs/talos_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:20.908Z] Uploading artifact public/test_info/chromez-e10s_errorsummary.log from file build/blobber_upload_dir/chromez-e10s_errorsummary.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:21.295Z] Uploading artifact public/test_info/chromez-e10s_raw.log from file build/blobber_upload_dir/chromez-e10s_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:21.697Z] Uploading artifact public/test_info/missingsymbols.zip from file build/blobber_upload_dir/missingsymbols.zip with content encoding "", mime type "application/zip" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:22.578Z] Uploading artifact public/test_info/profile_about_preferences_basic.zip from file build/blobber_upload_dir/profile_about_preferences_basic.zip with content encoding "", mime type "application/zip" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:23.241Z] Uploading artifact public/test_info/profile_tresize.zip from file build/blobber_upload_dir/profile_tresize.zip with content encoding "", mime type "application/zip" and expiry 2019-02-25T10:03:27.189Z
[taskcluster 2018-02-25T11:00:23.574Z] Task not successful due to following exception(s):
[taskcluster 2018-02-25T11:00:23.574Z] Exception 1)
[taskcluster 2018-02-25T11:00:23.574Z] signal: killed
Over the last 7 days this bug has 30 failures. These happen on OS X 10.10 and macosx64-nightly 

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165946894
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
these are all on profiling jobs- I suspect there is something broken there- possibly a quick fix.
Since 11.03 this fail spikes up and now there are 75 failures in the last 3 days.
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1440995&startday=2018-03-11&endday=2018-03-13&tree=all

Most of them are on OS X 10.10 and only 6 on macosx64-nightly. There are 2 on Linux but those are starred wrong
:rwood, can you please take a look?
this seems to be mostly related to the time to download artifact [1]:
3:14:41     INFO -  u'xpcshell': [u'target.common.tests.zip', u'target.xpcshell.tests.zip']}
23:14:41     INFO - Downloading packages: [u'target.common.tests.zip', u'target.mochitest.tests.zip'] for test suite categories: ['mochitest']
23:14:41     INFO - Downloading and extracting to /Users/cltbld/tasks/task_1520920450/build/tests these dirs * from https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.common.tests.zip
23:14:41     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.common.tests.zip'}, attempt #1
23:14:41     INFO - Fetch https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.common.tests.zip into memory
23:15:01     INFO - Content-Length response header: 36551587
23:15:01     INFO - Bytes received: 36551587
23:15:18     INFO - Downloading and extracting to /Users/cltbld/tasks/task_1520920450/build/tests these dirs * from https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.mochitest.tests.zip
23:15:18     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.mochitest.tests.zip'}, attempt #1
23:15:18     INFO - Fetch https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.mochitest.tests.zip into memory
23:52:34     INFO - Content-Length response header: 78551031
23:52:34     INFO - Bytes received: 78551031
23:52:46     INFO - Downloading https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.dmg to /Users/cltbld/tasks/task_1520920450/installer.dmg
23:52:46     INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.dmg', 'file_name': '/Users/cltbld/tasks/task_1520920450/installer.dmg'}, attempt #1
23:53:00     INFO - Downloaded 73040179 bytes.
23:53:00     INFO - Setting buildbot property build_url to https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.dmg
23:53:00     INFO - mkdir: /Users/cltbld/tasks/task_1520920450/properties
23:53:00     INFO - Writing buildbot properties ['build_url'] to /Users/cltbld/tasks/task_1520920450/properties/build_url
23:53:00     INFO - Writing to file /Users/cltbld/tasks/task_1520920450/properties/build_url


that is almost 30 minutes to download target.common.tests.zip which is 78MB.

In other cases [2], [3] we time out before finishing the download of a .zip or python package.  Here is a snippet from [4]:
15:04:59     INFO -  Collecting simplejson
15:06:15     INFO -    Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x1098fbd10>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
15:07:30     INFO -    Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x1098fb1d0>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
15:08:46     INFO -    Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x109a1d090>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub


I am tempted to make this a duplicate of bug 1443130, needinfo :dragrom as that is the next step on bug 1443130 as well.

[1] https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167596727
[2] https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167596731
[3] https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167793330
[4] https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167775711
Flags: needinfo?(rwood) → needinfo?(dcrisan)
Tested on t-yosemite-r7-423.test.releng.mdc1.mozilla.com:

[root@t-yosemite-r7-423.test.releng.mdc1.mozilla.com ~]# wget https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.common.tests.zip; rm target.common.tests.zip
--2018-03-14 05:24:24--  https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.common.tests.zip
Resolving queue.taskcluster.net... 50.16.228.78, 107.22.197.53, 50.17.218.87
Connecting to queue.taskcluster.net|50.16.228.78|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/aLf3dLO8T2eepBczSorNug/0/public/build/target.common.tests.zip [following]
--2018-03-14 05:24:25--  https://taskcluster-artifacts.net/aLf3dLO8T2eepBczSorNug/0/public/build/target.common.tests.zip
Resolving taskcluster-artifacts.net... 54.192.212.56
Connecting to taskcluster-artifacts.net|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36551587 (35M) [application/zip]
Saving to: 'target.common.tests.zip'

100%[==================================================================================================================================================================>] 36,551,587  6.30MB/s   in 11s    

2018-03-14 05:24:37 (3.26 MB/s) - 'target.common.tests.zip' saved [36551587/36551587]

[root@t-yosemite-r7-423.test.releng.mdc1.mozilla.com ~]# wget https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.common.tests.zip; rm target.common.tests.zip
--2018-03-14 05:24:43--  https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.common.tests.zip
Resolving queue.taskcluster.net... 50.16.228.78, 107.22.197.53, 50.17.218.87
Connecting to queue.taskcluster.net|50.16.228.78|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/aLf3dLO8T2eepBczSorNug/0/public/build/target.common.tests.zip [following]
--2018-03-14 05:24:44--  https://taskcluster-artifacts.net/aLf3dLO8T2eepBczSorNug/0/public/build/target.common.tests.zip
Resolving taskcluster-artifacts.net... 54.192.212.56
Connecting to taskcluster-artifacts.net|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36551587 (35M) [application/zip]
Saving to: 'target.common.tests.zip'

100%[==================================================================================================================================================================>] 36,551,587  5.32MB/s   in 7.7s   

2018-03-14 05:24:52 (4.53 MB/s) - 'target.common.tests.zip' saved [36551587/36551587]

[root@t-yosemite-r7-423.test.releng.mdc1.mozilla.com ~]# wget https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.mochitest.tests.zip; rm target.mochitest.tests.zip
--2018-03-14 05:26:14--  https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.mochitest.tests.zip
Resolving queue.taskcluster.net... 50.16.228.78, 107.22.197.53, 50.17.218.87
Connecting to queue.taskcluster.net|50.16.228.78|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/aLf3dLO8T2eepBczSorNug/0/public/build/target.mochitest.tests.zip [following]
--2018-03-14 05:26:15--  https://taskcluster-artifacts.net/aLf3dLO8T2eepBczSorNug/0/public/build/target.mochitest.tests.zip
Resolving taskcluster-artifacts.net... 54.192.212.56
Connecting to taskcluster-artifacts.net|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78551031 (75M) [application/zip]
Saving to: 'target.mochitest.tests.zip'

100%[==================================================================================================================================================================>] 78,551,031  5.71MB/s   in 18s    

2018-03-14 05:26:35 (4.16 MB/s) - 'target.mochitest.tests.zip' saved [78551031/78551031]

I also used curl instead of wget:
[root@t-yosemite-r7-423.test.releng.mdc1.mozilla.com ~]# curl -L -O  https://queue.taskcluster.net/v1/task/aLf3dLO8T2eepBczSorNug/artifacts/public/build/target.mochitest.tests.zip; rm target.mochitest.tests.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   124  100   124    0     0    104      0  0:00:01  0:00:01 --:--:--   104
100 74.9M  100 74.9M    0     0  4813k      0  0:00:15  0:00:15 --:--:-- 6135k

looks like the average speed is around 4,5 MB/s

If I run wget or curl on individual hosts or on the max 3 hosts in the same time, the download speed looks OK.(I tested 3 downloads in the same time from 3 different hosts). Will be interesting to see the network load if more workers will try to download the artifacts in the same time..... 

:dividehex Who can give us a network load for the last 24 hours? I suspect a network bottleneck, when many worker try to download the artifacts in the same time....
Flags: needinfo?(dcrisan) → needinfo?(jwatkins)
Blocks: 1446784
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(jwatkins)
Resolution: --- → DUPLICATE
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
You need to log in before you can comment on or make changes to this bug.