Closed
Bug 1300413
Opened 9 years ago
Closed 7 years ago
Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest'
Categories
(Release Engineering :: Applications: MozharnessCore, defect)
Release Engineering
Applications: MozharnessCore
Tracking
(e10s+, firefox50 affected, firefox51 affected, firefox52 wontfix)
People
(Reporter: aryx, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: intermittent-failure)
Attachments
(1 file)
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
Blocks: e10s-tests
tracking-e10s:
--- → +
| Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
Summary: Intermittent command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest', '--e10s', '--blob-upload-branch', 'fx-team', ' → Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest'
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Comment 12•9 years ago
|
||
We have a different signature now.
From the output [2] it is hard to tell at what point something could be getting messed up.
The last output comes from here [1].
Maybe the issue comes with:
> response.read()
We don't see any output from retrying a second time.
Any ideas on how to determine why we would stop having output?
I wonder if we should call python with -u
My second patch seems to have reduced how severe is:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300413&startday=2016-09-01&endday=2016-09-16&tree=all
[1]
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#395
[2]
> 03:35:26 INFO - Downloading and extracting to /builds/slave/test/build/symbols these dirs * from https://queue.taskcluster.net/v1/task/A3o2k8JHSpGW37nJuqdHAQ/artifacts/public/build/firefox-51.0a1.en-US.mac64.crashreporter-symbols.zip
> 03:35:26 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/A3o2k8JHSpGW37nJuqdHAQ/artifacts/public/build/firefox-51.0a1.en-US.mac64.crashreporter-symbols.zip'}, attempt #1
> 03:35:26 INFO - Fetch https://queue.taskcluster.net/v1/task/A3o2k8JHSpGW37nJuqdHAQ/artifacts/public/build/firefox-51.0a1.en-US.mac64.crashreporter-symbols.zip into memory
> 03:35:27 INFO - Expected file size: 135553597
>
> command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/mac_unittest.py', '--reftest-suite', 'reftest', '--blob-upload-branch', 'autoland', '--download-symbols', 'true'], attempting to kill
> process killed by signal 9
> program finished with exit code -1
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
Assignee: nobody → armenzg
Updated•9 years ago
|
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (mozreview-request) |
Comment 30•9 years ago
|
||
gps, dustin: it seems that Mac/Linux machines intermittently time out when downloading files (well, Buildbot kills the jobs because there is no output). Do you have any suggestions as to what to try?
These Mac machines have 16GB of RAM.
I'm running the download and extract action a hundred times and tracking the free memory.
I have not managed to drop below 12GB of free ram.
The pattern seems to show spikes over the weekends (except one weekend):
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300413&startday=2016-09-01&endday=2016-10-03&tree=all
There's something new I noticed. It does not only affect to the new download to memory method since it has also happened on mozilla-beta:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-beta&job_id=1703993#L513
Anyways, it might simply be bug bug 1306421 (Connecting to proxxy fails when running in TaskCluster)
This Linux failure is also interesting:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=3303263#L1660
The issues appear on pip:
> ChunkedEncodingError: IncompleteRead(1763 bytes read, 6242 more expected)
[1]
16:55:13 INFO - proxxy config: {'regions': ['.use1.', '.usw2.', '.scl3'], 'instances': ['proxxy1.srv.releng.use1.mozilla.com', 'proxxy1.srv.releng.usw2.mozilla.com', 'proxxy1.srv.releng.scl3.mozilla.com'], 'urls': [('http://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp-ssl.mozilla.org', 'ftp.mozilla.org'), ('http://pvtbuilds.pvt.build.mozilla.org', 'pvtbuilds.mozilla.org'), ('http://pypi.pvt.build.mozilla.org', 'pypi.pvt.build.mozilla.org'), ('http://pypi.pub.build.mozilla.org', 'pypi.pub.build.mozilla.org'), ('https://queue.taskcluster.net', 'queue.taskcluster.net')]}
16:55:13 INFO - https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip matches https://queue.taskcluster.net
16:55:13 INFO - URL Candidate: http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:55:13 INFO - trying http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:55:13 INFO - Downloading http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip to /builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:55:13 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #1
16:55:14 WARNING - Server returned status 404 HTTP Error 404: Not Found for http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:55:14 INFO - retry: attempt #1 caught exception: HTTP Error 404: Not Found
16:55:14 INFO - retry: Failed, sleeping 30 seconds before retrying
16:55:44 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #2
16:55:44 WARNING - Server returned status 404 HTTP Error 404: Not Found for http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:55:44 INFO - retry: attempt #2 caught exception: HTTP Error 404: Not Found
16:55:44 INFO - retry: Failed, sleeping 60 seconds before retrying
16:56:44 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #3
16:56:44 WARNING - Server returned status 404 HTTP Error 404: Not Found for http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:56:44 INFO - retry: attempt #3 caught exception: HTTP Error 404: Not Found
16:56:44 INFO - Can't download from http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip to /builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip!
16:56:44 INFO - Caught exception: HTTP Error 404: Not Found
16:56:44 INFO - Caught exception: HTTP Error 404: Not Found
16:56:44 INFO - Caught exception: HTTP Error 404: Not Found
16:56:44 INFO - trying https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:56:44 INFO - Downloading https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip to /builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip
16:56:44 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #1
command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/linux_unittest.py', '--mochitest-suite', 'mochitest-devtools-chrome', '--e10s', '--total-chunks', '8', '--this-chunk', '6', '--blob-upload-branch', 'mozilla-beta', '--download-symbols', 'ondemand'], attempting to kill
process killed by signal 9
program finished with exit code -1
| Comment hidden (mozreview-request) |
Comment 32•9 years ago
|
||
In MozReview I'm asking to run Linux and Mac with python -u.
Comment 33•9 years ago
|
||
I don't have any great ideas.
For a file that size, I can't see RAM usage being the problem.
Weekend spikes might just be philor being more diligent about starring than other sheriffs.
Proxxy appears unrelated here, since the error only occurs after 404's from proxxy have caused mozharness to move on to the queue itself.
It seems you're casting a wide net for similar events, which may not share the same cause. In particular, that pip issue seems unrelated.
Does the download function log output during the download? If not, unbuffering might not help..
Comment 34•9 years ago
|
||
| mozreview-review | ||
Comment on attachment 8797617 [details]
Bug 1300413 - Use unbuffered python in case we're missing output.
https://reviewboard.mozilla.org/r/83284/#review81788
All Python processes should be run in unbuffered mode in automation so we don't have problems with process output getting swallowed in a buffer. FWIW, lots of mozharness timings are wrong today because of output buffering of invoked subprocesses.
A more robust way to run Python processes in unbuffered mode is to set `PYTHONUNBUFFERED=1`. But please don't do that without measuring literally everything because making processes fully unbuffered could add a lot of overhead to shuffling data between processes and this may slow down automation.
Attachment #8797617 -
Flags: review+
Comment 35•9 years ago
|
||
| mozreview-review | ||
Comment on attachment 8797617 [details]
Bug 1300413 - Use unbuffered python in case we're missing output.
https://reviewboard.mozilla.org/r/83284/#review81796
Attachment #8797617 -
Flags: review?(bugspam.Callek) → review+
Comment 36•9 years ago
|
||
https://hg.mozilla.org/build/buildbot-configs/rev/1d22c76665b87944de3ab0108a63ff56bee9a3c3
Bug 1300413 - Use unbuffered python in case we're missing output. r=Callek
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Comment 45•9 years ago
|
||
Updated•9 years ago
|
Summary: Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest' → Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest'
Comment 46•9 years ago
|
||
And my scrambling around to put that -u into various bug summaries so they'll still match says that doing so hasn't stopped this, or the talos startup hang of bug 1298766 and bug 1192317.
| Comment hidden (Intermittent Failures Robot) |
Comment 48•9 years ago
|
||
-u does not necessarily improve seeing more in the logs [1].
My current thinking is that the server drops the ball when it is supposed to send us data and we get blocked with response.read() [2].
I wonder if we need to read the data differently or use the io module.
What we know is that the output stops before the info line [3]:
> file_contents = response.read()
> obtained_file_size = len(file_contents)
> self.info('Expected file size: {}'.format(expected_file_size))
What puzzles me more is why this is mainly affecting Mac machines. Could it be that they're at a datacenter? The Python version is 2.7.3
[1]
05:35:41 INFO - Downloading packages: [u'firefox-52.0a1.en-US.mac.common.tests.zip', u'firefox-52.0a1.en-US.mac.mochitest.tests.zip'] for test suite category: mochitest
05:35:41 INFO - Downloading and extracting to /builds/slave/test/build/tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, mochitest/* from https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.common.tests.zip
05:35:41 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.common.tests.zip'}, attempt #1
05:35:41 INFO - Fetch https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.common.tests.zip into memory
05:35:42 INFO - Http code: 200
05:35:42 INFO - via: 1.1 8008015354a3ca72f56c382a1d1cfe9f.cloudfront.net (CloudFront)
05:35:42 INFO - x-amz-cf-id: c7eNQKeUz3hCA1V7h_dFoRvoWAEiujYQ3-2Nk6VHjJnOpZeviLKzsA==
05:35:42 INFO - x-amz-version-id: 7UpkW70NVytWUuOXJjucmWx7M7TbW5NJ
05:35:42 INFO - Expected file size: 16508486
05:35:42 INFO - Obtained file size: 16508486
05:35:43 INFO - Downloading and extracting to /builds/slave/test/build/tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, mochitest/* from https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.mochitest.tests.zip
05:35:43 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.mochitest.tests.zip'}, attempt #1
05:35:43 INFO - Fetch https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.mochitest.tests.zip into memory
05:35:44 INFO - Http code: 200
05:35:44 INFO - via: 1.1 9552af4ceb2e0dd55964a80d1e549fd5.cloudfront.net (CloudFront)
05:35:44 INFO - x-amz-cf-id: 41QGIOHqdrk2JgKKOIfGn6MXlI7YDD51Hv2ejXBgzqBxYP_ha1v1qA==
05:35:44 INFO - x-amz-version-id: KwiCTmENhvXVmRkXD1Jji2O8cWlAv.Wc
command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/mac_unittest.py', '--mochitest-suite', 'plain-chunked', '--e10s', '--total-chunks', '5', '--this-chunk', '3', '--blob-upload-branch', 'autoland', '--download-symbols', 'ondemand'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1805.947631
[2] https://docs.python.org/2/library/stdtypes.html#file.read
> file.read([size])¶
> Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted,
> read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately.
> (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function
> fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was
> requested may be returned, even if no size parameter was given.
>
> Note: This function is simply a wrapper for the underlying fread() C function, and will behave the same in corner cases, such as whether the EOF value is cached.
[3] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py?q=x-amz&redirect_type=single#400
Comment 49•9 years ago
|
||
I suspect you're right -- I didn't realize this is just using urlopen, which doesn't do any kind of request timeout or anything like that. Perhaps upgrading to requests?
Comment 50•9 years ago
|
||
requests would be a good option. I would have to ask how to make a drop in mozilla-central and make Mozharness use of it.
I believe urlopen() does use timeout, however, the read() call is blocking.
I read this in the docs for urlopen:
https://docs.python.org/2/library/urllib2.html#urllib2.urlopen
> The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt
> (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and
> FTP connections.
| Comment hidden (Intermittent Failures Robot) |
Comment 52•9 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #50)
> requests would be a good option. I would have to ask how to make a drop in
> mozilla-central and make Mozharness use of it.
>
> I believe urlopen() does use timeout, however, the read() call is blocking.
>
> I read this in the docs for urlopen:
> https://docs.python.org/2/library/urllib2.html#urllib2.urlopen
> > The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt
> > (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and
> > FTP connections.
It doesn't look like this call to urllib2.urlopen is specifying a timeout. Can we try that first?
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py?q=x-amz&redirect_type=single#390
Comment 53•9 years ago
|
||
Requests is already in-tree (`python/requests`). Requests is a much more robust HTTP client. But just adding the timeout option might help this particular issue.
| Comment hidden (Intermittent Failures Robot) |
Comment 55•9 years ago
|
||
I also see requests failing in a similar manner when posting:
https://github.com/mozilla/build-blobuploader/blob/fc9923f7e817bb5c7dfa93509984c777110c492d/blobberc.py#L214
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=37736481#L6781
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
Assignee: armenzg → nobody
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Comment 69•8 years ago
|
||
Mass wontfix for bugs affecting firefox 52.
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•