Closed Bug 1479889 Opened 6 years ago Closed 4 years ago

Use fewer volumes on Windows workers

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: grenade)

References

Details

Attachments

(1 file)

Following discussion in bug 1350956, we think our drive use strategy on Windows AWS workers is too complicated. Specifically, we're using 2 EBS mounted drives when we should only need 1.

Having a single non-AMI initialized drive on Windows for "scratch" data (caches and task home directories) would be consistent with how Linux workers work (at least with docker-worker).

FWIW, I suspect the reason behind using 2 drives is the historical presence of 2 ephemeral drives on 3rd generation and older EC2 instances. You used to get the root volume (initialized from an AMI) and 2 empty per-instance ephemeral volumes. And if they were both available, you might as well make use of them. With the 4th generation instances (e.g. c4's and m4's) and later, these ephemeral volumes no longer exist. Except on the c5d/m5d/z1d instances, which have attached ephemeral NVMe storage.

So per pmoore's request in bug 1350956, I'm filing a bug to track consolidating all worker behavior to use a single volume.

The worker /may/ want to support using the root volume for whatever we put on Y: and Z: today. But in EC2, we shouldn't be putting per-task data on C: because of EBS volume initialization overhead (bug 1305174).
the drive setup on windows tc workers is currently:
C: os
Y: cache
Z: task

it was indeed set up this way because of the older ephemeral drives and there was a period of time when generic worker was not always cleaning up task directories completely so the generic worker wrapper script would quick format the z: drive between tasks in order to properly clean up after tasks. we didn't want to blow away caches between tasks so we opted to stick them on y: where they could escape death by format.

i believe the problems with gw cleaning up task folders is mostly resolved. it popped up again recently on windows 10 hardware instances (bug 1433854), but those don't use a z: drive anyway, so i doubt that's still a blocker.

i'll have a go at removing the drive formatting that currently occurs on z: drives between tasks and moving caches to the z: drive. if that goes smoothly we can see what we can gain by putting everything on c:
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
i will look at this after bug 1433854 is resolved. we can't remove the cleanup workarounds that occ is doing until gw is correctly removing task folders.
Depends on: 1433854
Attached file GitHub Pull Request
this pr removes drive formatting of the z: drive between generic worker task runs. this is done in preparation for reducing the number of volumes since we don't want to format drives containing caches or other required data.

also in the gw wrapper script there were some lines relating to a deprecated loaner mechanism that have been redundant for some time and do nothing. i've removed them to help reduce the unnecessary complexity of this script.
Attachment #9015787 - Flags: review?(mcornmesser)
Attachment #9015787 - Flags: review?(mcornmesser) → review+
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: