Open Bug 1526311 Opened 5 years ago Updated 9 months ago

Manage disk space on caches partition, if different to partition for task directories

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

REOPENED

People

(Reporter: pmoore, Unassigned)

References

(Blocks 1 open bug)

Details

We currently make sure at the start of each task that enough disk space is available on the task directories disk partition. However, if caches are stored on a different partition to task directories, the clean up strategy for the task directories partition won't work (since it removes caches to free up space), and secondly if the caches partition fills up, we don't have a mechanism to free up space there.

This is the probable cause for bug 1519472.

We should ensure that both partitions have adequate disk space before each task, and only delete content on the partition that we are trying to free up space on.

Note, it is potentially suboptimal to put caches on a different partition to task directories, since caches then have to be moved with copy/delete semantics, which can be a very expensive operation e.g. when there are many files in the cache.

Mounting a cache from the same partition essentially involves updating a single inode.

Originally, on gecko windows workers, caches were put on Y: drive and task directories on Z: drive because OpenCloudConfig formatted the Z: drive on reboot, since generic-worker was not always able to delete previous task directories.

If that is no longer the case, the format-on-boot strategy could be revised. I have a nagging feeling that there might be an open issue whereby some tasks are creating files/folders and then denying all other users delete permission, such that the worker did not have permission to delete them. I suspect we might need to first take ownership of the files/folders and their subdirectories before deleting them (or for a slightly more optimal solution for the common case, only taking ownership if the initial deletion attempt fails, and then attempting to delete again).

I'll try to find some time in the next week to look into options...

Component: Generic-Worker → Workers

(In reply to Pete Moore [:pmoore][:pete] from comment #1)

Note, it is potentially suboptimal to put caches on a different partition to task directories, since caches then have to be moved with copy/delete semantics, which can be a very expensive operation e.g. when there are many files in the cache.

I filed bug 1527313 to track enabling this cache on Windows builds (bug 1519472 landed by disabling caching on the affected tasks). If moving the checkout back to Z: is more optimal, I think that's a valid way to fix this particular issue.

Though in that case maybe it would be better to make sure there's a better error message if we try to mount a cache on a different partition. Maybe we can just ban doing that altogether.

Assignee: nobody → pmoore
No longer blocks: 1519472
No longer blocks: 1527313

We should still do this, but I don't think I'm going to get to it soon, so unassigning myself.

Assignee: pmoore → nobody

Note, this is now goal G9 of the Worker Roadmap 2020.

Not actively working on this at the moment.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INACTIVE

Reopening inactive bugs, because they may still need attention. Historically, inactive bugs were closed, but this hides the fact there are genuine issues which have not been resolved.

Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
You need to log in before you can comment on or make changes to this bug.