Closed Bug 1300413 Opened 9 years ago Closed 7 years ago

Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest'

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(e10s+, firefox50 affected, firefox51 affected, firefox52 wontfix)

RESOLVED WONTFIX
Tracking Status
e10s + ---
firefox50 --- affected
firefox51 --- affected
firefox52 --- wontfix

People

(Reporter: aryx, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Blocks: e10s-tests
tracking-e10s: --- → +
Component: General Automation → Mozharness
Depends on: 1300812
QA Contact: catlee → jlund
Summary: Intermittent command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest', '--e10s', '--blob-upload-branch', 'fx-team', ' → Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest'
We have a different signature now. From the output [2] it is hard to tell at what point something could be getting messed up. The last output comes from here [1]. Maybe the issue comes with: > response.read() We don't see any output from retrying a second time. Any ideas on how to determine why we would stop having output? I wonder if we should call python with -u My second patch seems to have reduced how severe is: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300413&startday=2016-09-01&endday=2016-09-16&tree=all [1] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#395 [2] > 03:35:26 INFO - Downloading and extracting to /builds/slave/test/build/symbols these dirs * from https://queue.taskcluster.net/v1/task/A3o2k8JHSpGW37nJuqdHAQ/artifacts/public/build/firefox-51.0a1.en-US.mac64.crashreporter-symbols.zip > 03:35:26 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/A3o2k8JHSpGW37nJuqdHAQ/artifacts/public/build/firefox-51.0a1.en-US.mac64.crashreporter-symbols.zip'}, attempt #1 > 03:35:26 INFO - Fetch https://queue.taskcluster.net/v1/task/A3o2k8JHSpGW37nJuqdHAQ/artifacts/public/build/firefox-51.0a1.en-US.mac64.crashreporter-symbols.zip into memory > 03:35:27 INFO - Expected file size: 135553597 > > command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/mac_unittest.py', '--reftest-suite', 'reftest', '--blob-upload-branch', 'autoland', '--download-symbols', 'true'], attempting to kill > process killed by signal 9 > program finished with exit code -1
Depends on: 1303759
Assignee: nobody → armenzg
Depends on: 1307103
gps, dustin: it seems that Mac/Linux machines intermittently time out when downloading files (well, Buildbot kills the jobs because there is no output). Do you have any suggestions as to what to try? These Mac machines have 16GB of RAM. I'm running the download and extract action a hundred times and tracking the free memory. I have not managed to drop below 12GB of free ram. The pattern seems to show spikes over the weekends (except one weekend): https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1300413&startday=2016-09-01&endday=2016-10-03&tree=all There's something new I noticed. It does not only affect to the new download to memory method since it has also happened on mozilla-beta: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-beta&job_id=1703993#L513 Anyways, it might simply be bug bug 1306421 (Connecting to proxxy fails when running in TaskCluster) This Linux failure is also interesting: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=3303263#L1660 The issues appear on pip: > ChunkedEncodingError: IncompleteRead(1763 bytes read, 6242 more expected) [1] 16:55:13 INFO - proxxy config: {'regions': ['.use1.', '.usw2.', '.scl3'], 'instances': ['proxxy1.srv.releng.use1.mozilla.com', 'proxxy1.srv.releng.usw2.mozilla.com', 'proxxy1.srv.releng.scl3.mozilla.com'], 'urls': [('http://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp-ssl.mozilla.org', 'ftp.mozilla.org'), ('http://pvtbuilds.pvt.build.mozilla.org', 'pvtbuilds.mozilla.org'), ('http://pypi.pvt.build.mozilla.org', 'pypi.pvt.build.mozilla.org'), ('http://pypi.pub.build.mozilla.org', 'pypi.pub.build.mozilla.org'), ('https://queue.taskcluster.net', 'queue.taskcluster.net')]} 16:55:13 INFO - https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip matches https://queue.taskcluster.net 16:55:13 INFO - URL Candidate: http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:55:13 INFO - trying http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:55:13 INFO - Downloading http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip to /builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:55:13 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #1 16:55:14 WARNING - Server returned status 404 HTTP Error 404: Not Found for http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:55:14 INFO - retry: attempt #1 caught exception: HTTP Error 404: Not Found 16:55:14 INFO - retry: Failed, sleeping 30 seconds before retrying 16:55:44 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #2 16:55:44 WARNING - Server returned status 404 HTTP Error 404: Not Found for http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:55:44 INFO - retry: attempt #2 caught exception: HTTP Error 404: Not Found 16:55:44 INFO - retry: Failed, sleeping 60 seconds before retrying 16:56:44 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #3 16:56:44 WARNING - Server returned status 404 HTTP Error 404: Not Found for http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:56:44 INFO - retry: attempt #3 caught exception: HTTP Error 404: Not Found 16:56:44 INFO - Can't download from http://queue.taskcluster.net.proxxy1.srv.releng.use1.mozilla.com/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip to /builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip! 16:56:44 INFO - Caught exception: HTTP Error 404: Not Found 16:56:44 INFO - Caught exception: HTTP Error 404: Not Found 16:56:44 INFO - Caught exception: HTTP Error 404: Not Found 16:56:44 INFO - trying https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:56:44 INFO - Downloading https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip to /builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip 16:56:44 INFO - retry: Calling _download_file with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/SsD-cbrbTsmlSb7FRj9-Hg/artifacts/public/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip', 'file_name': u'/builds/slave/test/build/firefox-50.0.en-US.linux-x86_64-asan.mochitest.tests.zip'}, attempt #1 command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/linux_unittest.py', '--mochitest-suite', 'mochitest-devtools-chrome', '--e10s', '--total-chunks', '8', '--this-chunk', '6', '--blob-upload-branch', 'mozilla-beta', '--download-symbols', 'ondemand'], attempting to kill process killed by signal 9 program finished with exit code -1
In MozReview I'm asking to run Linux and Mac with python -u.
I don't have any great ideas. For a file that size, I can't see RAM usage being the problem. Weekend spikes might just be philor being more diligent about starring than other sheriffs. Proxxy appears unrelated here, since the error only occurs after 404's from proxxy have caused mozharness to move on to the queue itself. It seems you're casting a wide net for similar events, which may not share the same cause. In particular, that pip issue seems unrelated. Does the download function log output during the download? If not, unbuffering might not help..
Comment on attachment 8797617 [details] Bug 1300413 - Use unbuffered python in case we're missing output. https://reviewboard.mozilla.org/r/83284/#review81788 All Python processes should be run in unbuffered mode in automation so we don't have problems with process output getting swallowed in a buffer. FWIW, lots of mozharness timings are wrong today because of output buffering of invoked subprocesses. A more robust way to run Python processes in unbuffered mode is to set `PYTHONUNBUFFERED=1`. But please don't do that without measuring literally everything because making processes fully unbuffered could add a lot of overhead to shuffling data between processes and this may slow down automation.
Attachment #8797617 - Flags: review+
Comment on attachment 8797617 [details] Bug 1300413 - Use unbuffered python in case we're missing output. https://reviewboard.mozilla.org/r/83284/#review81796
Attachment #8797617 - Flags: review?(bugspam.Callek) → review+
Summary: Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest' → Intermittent-infra command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/web_platform_tests.py', '--cfg', 'web_platform_tests/prod_config.py', '--test-type=reftest'
And my scrambling around to put that -u into various bug summaries so they'll still match says that doing so hasn't stopped this, or the talos startup hang of bug 1298766 and bug 1192317.
-u does not necessarily improve seeing more in the logs [1]. My current thinking is that the server drops the ball when it is supposed to send us data and we get blocked with response.read() [2]. I wonder if we need to read the data differently or use the io module. What we know is that the output stops before the info line [3]: > file_contents = response.read() > obtained_file_size = len(file_contents) > self.info('Expected file size: {}'.format(expected_file_size)) What puzzles me more is why this is mainly affecting Mac machines. Could it be that they're at a datacenter? The Python version is 2.7.3 [1] 05:35:41 INFO - Downloading packages: [u'firefox-52.0a1.en-US.mac.common.tests.zip', u'firefox-52.0a1.en-US.mac.mochitest.tests.zip'] for test suite category: mochitest 05:35:41 INFO - Downloading and extracting to /builds/slave/test/build/tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, mochitest/* from https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.common.tests.zip 05:35:41 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.common.tests.zip'}, attempt #1 05:35:41 INFO - Fetch https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.common.tests.zip into memory 05:35:42 INFO - Http code: 200 05:35:42 INFO - via: 1.1 8008015354a3ca72f56c382a1d1cfe9f.cloudfront.net (CloudFront) 05:35:42 INFO - x-amz-cf-id: c7eNQKeUz3hCA1V7h_dFoRvoWAEiujYQ3-2Nk6VHjJnOpZeviLKzsA== 05:35:42 INFO - x-amz-version-id: 7UpkW70NVytWUuOXJjucmWx7M7TbW5NJ 05:35:42 INFO - Expected file size: 16508486 05:35:42 INFO - Obtained file size: 16508486 05:35:43 INFO - Downloading and extracting to /builds/slave/test/build/tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, mochitest/* from https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.mochitest.tests.zip 05:35:43 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.mochitest.tests.zip'}, attempt #1 05:35:43 INFO - Fetch https://queue.taskcluster.net/v1/task/cvoU90aiSdeTUZ62pfWu5Q/artifacts/public/build/firefox-52.0a1.en-US.mac.mochitest.tests.zip into memory 05:35:44 INFO - Http code: 200 05:35:44 INFO - via: 1.1 9552af4ceb2e0dd55964a80d1e549fd5.cloudfront.net (CloudFront) 05:35:44 INFO - x-amz-cf-id: 41QGIOHqdrk2JgKKOIfGn6MXlI7YDD51Hv2ejXBgzqBxYP_ha1v1qA== 05:35:44 INFO - x-amz-version-id: KwiCTmENhvXVmRkXD1Jji2O8cWlAv.Wc command timed out: 1800 seconds without output running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/desktop_unittest.py', '--cfg', 'unittests/mac_unittest.py', '--mochitest-suite', 'plain-chunked', '--e10s', '--total-chunks', '5', '--this-chunk', '3', '--blob-upload-branch', 'autoland', '--download-symbols', 'ondemand'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=1805.947631 [2] https://docs.python.org/2/library/stdtypes.html#file.read > file.read([size])¶ > Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, > read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. > (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function > fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was > requested may be returned, even if no size parameter was given. > > Note: This function is simply a wrapper for the underlying fread() C function, and will behave the same in corner cases, such as whether the EOF value is cached. [3] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py?q=x-amz&redirect_type=single#400
I suspect you're right -- I didn't realize this is just using urlopen, which doesn't do any kind of request timeout or anything like that. Perhaps upgrading to requests?
requests would be a good option. I would have to ask how to make a drop in mozilla-central and make Mozharness use of it. I believe urlopen() does use timeout, however, the read() call is blocking. I read this in the docs for urlopen: https://docs.python.org/2/library/urllib2.html#urllib2.urlopen > The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt > (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and > FTP connections.
(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #50) > requests would be a good option. I would have to ask how to make a drop in > mozilla-central and make Mozharness use of it. > > I believe urlopen() does use timeout, however, the read() call is blocking. > > I read this in the docs for urlopen: > https://docs.python.org/2/library/urllib2.html#urllib2.urlopen > > The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt > > (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and > > FTP connections. It doesn't look like this call to urllib2.urlopen is specifying a timeout. Can we try that first? https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py?q=x-amz&redirect_type=single#390
Requests is already in-tree (`python/requests`). Requests is a much more robust HTTP client. But just adding the timeout option might help this particular issue.
Depends on: 1309912
Assignee: armenzg → nobody
Mass wontfix for bugs affecting firefox 52.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: