Open Bug 1569856 Opened 5 years ago Updated 1 year ago

Android 7.0 x86 IOError: [Errno 28] No space left on device:

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

REOPENED

People

(Reporter: CosminS, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: docker-worker)

Attachments

(1 file, 1 obsolete file)

Th push: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception&tochange=32f944ce7a046d9a8ad1ccae0747b8a4e09c36a5&fromchange=00d1d637d22a8d9fe6f0f391a7cdee874f511e8d&searchStr=Android%2C7.0%2Cx86&selectedJob=258936798

Failure logs: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258936798&repo=autoland&lineNumber=669
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258931944&repo=autoland&lineNumber=338

vcs 2019-07-30T03:35:20.141Z] files [======> ] 80118/525924 10m33s
[vcs 2019-07-30T03:35:20.141Z]
[vcs 2019-07-30T03:35:20.141Z] transaction abort!
[vcs 2019-07-30T03:35:20.158Z] failed to truncate 00changelog.i
[vcs 2019-07-30T03:35:20.158Z] rollback failed - please run hg recover
[vcs 2019-07-30T03:35:21.451Z] PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "clone_errored", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.0596899986267}, {"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "overall", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.6039879322052}, {"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "overall_clone", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.6039879322052}, {"extraOptions": ["packet.net"], "lowerIsBetter": true, "name": "overall_clone_fullcheckout", "serverUrl": "hg.mozilla.org", "shouldAlert": false, "subtests": [], "value": 340.6039879322052}]}
[vcs 2019-07-30T03:35:21.451Z] abort: No space left on device: /builds/worker/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg/store/data/dom/ipc/_content_parent.cpp.d
[taskcluster 2019-07-30 03:35:21.989Z] === Task Finished ===
[taskcluster 2019-07-30 03:35:22.159Z] Artifact "public/build/logs" not found at "/builds/worker/workspace/build/logs"
[taskcluster 2019-07-30 03:35:22.787Z] Unsuccessful task run with exit code: 255 completed in 408.334 seconds

[task 2019-07-30T04:10:23.144Z] 04:10:23 INFO - Running post-action listener: setup_coverage_tools
[task 2019-07-30T04:10:23.144Z] 04:10:23 INFO - [mozharness: 2019-07-30 04:10:23.144091Z] Finished download-and-extract step (failed)
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - Uncaught exception: Traceback (most recent call last):
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 2097, in run
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - self.run_action(action)
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 2036, in run_action
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - self._possibly_run_method(method_name, error_if_missing=True)
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 1991, in _possibly_run_method
[task 2019-07-30T04:10:23.147Z] 04:10:23 FATAL - return getattr(self, method_name)()
[task 2019-07-30T04:10:23.148Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/scripts/web_platform_tests.py", line 320, in download_and_extract
[task 2019-07-30T04:10:23.149Z] 04:10:23 FATAL - suite_categories=["web-platform"])
[task 2019-07-30T04:10:23.149Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 472, in download_and_extract
[task 2019-07-30T04:10:23.149Z] 04:10:23 FATAL - self._download_test_packages(suite_categories, extract_dirs)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 373, in _download_test_packages
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - extract_dirs=unpack_dirs)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 738, in download_unpack
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - function(**kwargs)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/builds/worker/workspace/mozharness/mozharness/base/script.py", line 645, in deflate
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - t.extractall(path=extract_to)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2079, in extractall
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - self.extract(tarinfo, path)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2116, in extract
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2192, in _extract_member
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - self.makefile(tarinfo, targetpath)
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - File "/usr/lib/python2.7/tarfile.py", line 2232, in makefile
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - with bltn_open(targetpath, "wb") as target:
[task 2019-07-30T04:10:23.150Z] 04:10:23 FATAL - IOError: [Errno 28] No space left on device: '/builds/worker/workspace/build/tests/web-platform/tests/orientation-event/t028-manual.https.html'
[task 2019-07-30T04:10:23.151Z] 04:10:23 FATAL - Running post_fatal callback...
[task 2019-07-30T04:10:23.151Z] 04:10:23 FATAL - Exiting -1
[task 2019-07-30T04:10:23.151Z] 04:10:23 INFO - Running post-run listener: _resource_record_post_run
[task 2019-07-30T04:10:23.244Z] cleanup
[task 2019-07-30T04:10:23.245Z] + cleanup
[task 2019-07-30T04:10:23.245Z] + local rv=255
[task 2019-07-30T04:10:23.245Z] + [[ -s /builds/worker/.xsession-errors ]]
[task 2019-07-30T04:10:23.245Z] + cp /builds/worker/.xsession-errors /builds/worker/artifacts/public/xsession-errors.log
[task 2019-07-30T04:10:23.251Z] cp: cannot create regular file '/builds/worker/artifacts/public/xsession-errors.log': No space left on device
[taskcluster 2019-07-30 04:10:25.618Z] === Task Finished ===

Bob can you please take a look at this?

Flags: needinfo?(bob)

I'll punt to gbrown who knows everything there is about the android emulators. I expect we just need to increase their sizes.

Flags: needinfo?(bob) → needinfo?(gbrown)

Machines responsible for this have been quarantined:

grenade> andrei_ciure|sheriffduty, Aryx: I have quarantined 1, 10, 12 & 35.

Hi Wander, can you also take a look at this issue?

Flags: needinfo?(wcosta)
Flags: needinfo?(gbrown)

At first glance I suspect this is a problem with the state of the packet.net machines. Last night there was a change to the wrench tasks that caused download paths to change; I wonder if those problems left behind large files in unexpected places. :wcosta or :coop might be better equipped to resolve this. I will investigate more...

Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
See Also: 15096701569817

In try pushes I am seeing 50% to 80% %use according to df, and I can't find any extra files. I suppose a variety of disk availability is expected from the task perspective since we run up to 4 tasks per worker on packet.net? In that case, I'm not seeing anything wrong -- but I'm not looking at the quarantined machines.

Probably the next step is for someone to look at the quarantined machines (1, 10, 12 & 35), see if they can be cleaned up and brought back into service. I don't know how to do that. Let's ni a couple of other people who might know...

Flags: needinfo?(coop)
Flags: needinfo?(aerickson)

I don't have permissions to ssh to the hosts or quarantine.

I'll work with Wander to get the required permissions and figure out how to care for these.

Flags: needinfo?(aerickson)

The root of the problem is docker-worker volume caching

root@machine-1:/mnt/var/cache/docker-worker# du -hs *
16K gecko-level-1-checkouts-v3-33ea6ead87f10b63cd64
16K gecko-level-1-checkouts-v3-382574ba03a201a3ed4a
28K gecko-level-1-checkouts-v3-694222febc6321e83215
43G gecko-level-1-checkouts-v3-8be03508dc6d71e4397d
627M gecko-level-1-tooltool-cache-v3-33ea6ead87f10b63cd64
731M gecko-level-1-tooltool-cache-v3-382574ba03a201a3ed4a
1.6G gecko-level-1-tooltool-cache-v3-694222febc6321e83215
9.5G gecko-level-1-tooltool-cache-v3-8be03508dc6d71e4397d
52K gecko-level-2-checkouts-v3-8be03508dc6d71e4397d
2.9G gecko-level-2-tooltool-cache-v3-8be03508dc6d71e4397d
28K gecko-level-3-checkouts-v3-33ea6ead87f10b63cd64
50G gecko-level-3-checkouts-v3-8be03508dc6d71e4397d
22G gecko-level-3-checkouts-v3-df476dba6f950ad72a52
1.3G gecko-level-3-tooltool-cache-v3-33ea6ead87f10b63cd64
7.0G gecko-level-3-tooltool-cache-v3-8be03508dc6d71e4397d
3.7G gecko-level-3-tooltool-cache-v3-df476dba6f950ad72a52

I am going to investigate this in the docker-worker code and take care of the quarantined machines.

Flags: needinfo?(wcosta)

(In reply to Wander Lairson Costa [:wcosta] from comment #9)

I am going to investigate this in the docker-worker code and take care of the quarantined machines.

Since there does seem to be an issue with docker-worker itself (or at least its caches -- that's a lot of cache), I think it's on Wander to fix this for this go-round.

We should absolutely get relops access to these machines at the end of that process so they can help manage these going forward.

Flags: needinfo?(coop)
Component: Testing → General
Product: Firefox for Android → Taskcluster
Whiteboard: [stockwell disable-recommended]

No failures here since the 1st of August.

Whiteboard: [stockwell disable-recommended]
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

There are still quite a few quarantined workers because of this; will those be recovered?

Sorry, I jumped the gun, didn't I..

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #9083783 - Attachment is obsolete: true
Regressions: 1573701
Whiteboard: [stockwell disable-recommended]
Component: General → Workers
Whiteboard: docker-worker

(closed as part of mass closure of old intermittent bugs)

Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → INACTIVE

Reopening inactive bugs, because they may still need attention. Historically, inactive bugs were closed, but this hides the fact there are genuine issues which have not been resolved.

Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: