Android 7.0 x86 IOError: [Errno 28] No space left on device:
Categories
(Taskcluster :: Workers, defect)
Tracking
(Not tracked)
People
(Reporter: CosminS, Unassigned)
References
Details
(Keywords: intermittent-failure, Whiteboard: docker-worker)
Attachments
(1 file, 1 obsolete file)
Failure logs: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258936798&repo=autoland&lineNumber=669
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258931944&repo=autoland&lineNumber=338
vcs 2019-07-30T03:35:20.141Z] files [======> ] 80118/525924 10m33s
[vcs 2019-07-30T03:35:20.141Z]
[vcs 2019-07-30T03:35:20.141Z] transaction abort!
[vcs 2019-07-30T03:35:20.158Z] failed to truncate 00changelog.i
[vcs 2019-07-30T03:35:20.158Z] rollback failed - please run hg recover
[vcs 2019-07-30T03:35:21.451Z] PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "clone_errored", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.0596899986267}, {"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "overall", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.6039879322052}, {"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "overall_clone", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.6039879322052}, {"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "overall_clone_fullcheckout", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.6039879322052}]}
[vcs 2019-07-30T03:35:21.451Z] abort: No space left on device: /builds/worker/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg/store/data/dom/ipc/_content_parent.cpp.d
[taskcluster 2019-07-30 03:35:21.989Z] === Task Finished ===
[taskcluster 2019-07-30 03:35:22.159Z] Artifact "public/build/logs" not found at "/builds/worker/workspace/build/logs"
[taskcluster 2019-07-30 03:35:22.787Z] Unsuccessful task run with exit code: 255 completed in 408.334 seconds
[task 2019-07-30T04:10:23.144Z] 04:10:23 INFO - Running post-action listener: setup_coverage_tools
[task 2019-07-30T04:10:23.144Z] 04:10:23 INFO - [mozharness: 2019-07-30 04:10:23.144091Z] Finished download-and-extract step (failed)
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - Uncaught exception: Traceback (most recent call last):
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 2097, in run
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - self.run_action(action)
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 2036, in run_action
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - self._possibly_run_method(method_name, error_if_missing=True)
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 1991, in _possibly_run_method
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - return getattr(self, method_name)()
[task 2019-07-30T04:10:23.148Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/scripts/web_platform_tests.py", line 320, in download_and_extract
[task 2019-07-30T04:10:23.149Z] 04:10:23 FATAL - suite_categories=["web-platform"])
[task 2019-07-30T04:10:23.149Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 472, in download_and_extract
[task 2019-07-30T04:10:23.149Z] 04:10:23 FATAL - self._download_test_packages(suite_categories, extract_dirs)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 373, in _download_test_packages
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - extract_dirs=unpack_dirs)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 738, in download_unpack
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - function(**kwargs)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 645, in deflate
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - t.extractall(path=extract_to)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2079, in extractall
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - self.extract(tarinfo, path)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2116, in extract
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2192, in _extract_member
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - self.makefile(tarinfo, targetpath)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2232, in makefile
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - with bltn_open(targetpath, "wb") as target:
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - IOError: [Errno 28] No space left on device: '/builds/worker/workspace/build/tests/web-platform/tests/orientation-event/t028-manual.https.html'
[task 2019-07-30T04:10:23.151Z] 04:10:23 FATAL - Running post_fatal callback...
[task 2019-07-30T04:10:23.151Z] 04:10:23 FATAL - Exiting -1
[task 2019-07-30T04:10:23.151Z] 04:10:23 INFO - Running post-run listener: _resource_record_post_run
[task 2019-07-30T04:10:23.244Z] cleanup
[task 2019-07-30T04:10:23.245Z] + cleanup
[task 2019-07-30T04:10:23.245Z] + local rv=255
[task 2019-07-30T04:10:23.245Z] + [[ -s /builds/worker/.xsession-errors ]]
[task 2019-07-30T04:10:23.245Z] + cp /builds/worker/.xsession-errors /builds/worker/artifacts/public/xsession-errors.log
[task 2019-07-30T04:10:23.251Z] cp: cannot create regular file '/builds/worker/artifacts/public/xsession-errors.log': No space left on device
[taskcluster 2019-07-30 04:10:25.618Z] === Task Finished ===
Reporter | ||
Comment 1•5 years ago
•
|
||
Looking at these 16 failures in about 1hr so far https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-07-23&endday=2019-07-30&tree=trunk&bug=1569856, most happen on
https://tools.taskcluster.net/provisioners/terraform-packet/worker-types/gecko-t-linux/workers/packet-sjc1/machine-12 https://tools.taskcluster.net/provisioners/terraform-packet/worker-types/gecko-t-linux/workers/packet-sjc1/machine-10
https://tools.taskcluster.net/provisioners/terraform-packet/worker-types/gecko-t-linux/workers/packet-sjc1/machine-35
Comment 3•5 years ago
|
||
I'll punt to gbrown who knows everything there is about the android emulators. I expect we just need to increase their sizes.
Comment 4•5 years ago
|
||
Machines responsible for this have been quarantined:
grenade> andrei_ciure|sheriffduty, Aryx: I have quarantined 1, 10, 12 & 35.
Hi Wander, can you also take a look at this issue?
Updated•5 years ago
|
Updated•5 years ago
|
Comment 5•5 years ago
|
||
At first glance I suspect this is a problem with the state of the packet.net machines. Last night there was a change to the wrench tasks that caused download paths to change; I wonder if those problems left behind large files in unexpected places. :wcosta or :coop might be better equipped to resolve this. I will investigate more...
Updated•5 years ago
|
Updated•5 years ago
|
Comment 6•5 years ago
|
||
In try pushes I am seeing 50% to 80% %use according to df, and I can't find any extra files. I suppose a variety of disk availability is expected from the task perspective since we run up to 4 tasks per worker on packet.net? In that case, I'm not seeing anything wrong -- but I'm not looking at the quarantined machines.
Probably the next step is for someone to look at the quarantined machines (1, 10, 12 & 35), see if they can be cleaned up and brought back into service. I don't know how to do that. Let's ni a couple of other people who might know...
Comment 7•5 years ago
|
||
machine-30 needs to be quarantined too: https://tools.taskcluster.net/provisioners/terraform-packet/worker-types/gecko-t-linux/workers/packet-sjc1/machine-30
Comment 8•5 years ago
|
||
I don't have permissions to ssh to the hosts or quarantine.
I'll work with Wander to get the required permissions and figure out how to care for these.
Comment 9•5 years ago
|
||
The root of the problem is docker-worker volume caching
root@machine-1:/mnt/var/cache/docker-worker# du -hs *
16K gecko-level-1-checkouts-v3-33ea6ead87f10b63cd64
16K gecko-level-1-checkouts-v3-382574ba03a201a3ed4a
28K gecko-level-1-checkouts-v3-694222febc6321e83215
43G gecko-level-1-checkouts-v3-8be03508dc6d71e4397d
627M gecko-level-1-tooltool-cache-v3-33ea6ead87f10b63cd64
731M gecko-level-1-tooltool-cache-v3-382574ba03a201a3ed4a
1.6G gecko-level-1-tooltool-cache-v3-694222febc6321e83215
9.5G gecko-level-1-tooltool-cache-v3-8be03508dc6d71e4397d
52K gecko-level-2-checkouts-v3-8be03508dc6d71e4397d
2.9G gecko-level-2-tooltool-cache-v3-8be03508dc6d71e4397d
28K gecko-level-3-checkouts-v3-33ea6ead87f10b63cd64
50G gecko-level-3-checkouts-v3-8be03508dc6d71e4397d
22G gecko-level-3-checkouts-v3-df476dba6f950ad72a52
1.3G gecko-level-3-tooltool-cache-v3-33ea6ead87f10b63cd64
7.0G gecko-level-3-tooltool-cache-v3-8be03508dc6d71e4397d
3.7G gecko-level-3-tooltool-cache-v3-df476dba6f950ad72a52
I am going to investigate this in the docker-worker code and take care of the quarantined machines.
Comment 10•5 years ago
•
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #9)
I am going to investigate this in the docker-worker code and take care of the quarantined machines.
Since there does seem to be an issue with docker-worker itself (or at least its caches -- that's a lot of cache), I think it's on Wander to fix this for this go-round.
We should absolutely get relops access to these machines at the end of that process so they can help manage these going forward.
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 14•5 years ago
|
||
No failures here since the 1st of August.
Updated•5 years ago
|
Comment 15•5 years ago
|
||
There are still quite a few quarantined workers because of this; will those be recovered?
Comment 16•5 years ago
|
||
Sorry, I jumped the gun, didn't I..
Comment 17•5 years ago
|
||
Comment 18•5 years ago
|
||
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 24•4 years ago
|
||
(closed as part of mass closure of old intermittent bugs)
Comment 25•1 year ago
|
||
Reopening inactive bugs, because they may still need attention. Historically, inactive bugs were closed, but this hides the fact there are genuine issues which have not been resolved.
Updated•1 year ago
|
Description
•