Closed Bug 1378381 Opened 7 years ago Closed 7 years ago

OpenCloudConfig: avoid long-running format of EBS backed Z: drive

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: grenade)

References

Details

On Windows workers managed by OpenCloudConfig, we currently format the Z: drive[1] between task runs in order to improve efficiency of reads from Z: drive (due to the copy-on-read semantics).

This can take 20-25 mins, so we should avoid this. Instead we should probably create a volume at startup on the instance. The two possible volume types would be an EBS volume (remote) or an instance store volume (local). An instance store volume might not be possible on all instance types, so we'll need to check if there are appropriate instance types we can use in all cases, that suit all our requirements (including pricing!). I'm not sure at the moment if dynamically creating an EBS volume from scratch mitigates the need to format the drive for performance gain. We would need to test this. It looks like an EBS volume can be initialised via powershell[2].

AWS provides comprehensive documentation about block device mapping configuration[3].


--

[1] https://github.com/mozilla-releng/OpenCloudConfig/blob/9e615f9b56026faca9307f8dc582097f101a6d67/userdata/Configuration/GenericWorker/run-generic-worker-format-and-reboot.bat#L37

[2] http://docs.aws.amazon.com/powershell/latest/reference/items/New-EC2Volume.html

[3] http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/block-device-mapping-concepts.html
Instance store volumes are going the way of the dodo and don't exist on modern EC2 instance types. Everything is backed by EBS, right?

Either way, yes, attaching a fresh EBS volume and doing a quick format is the way to go. If you initialize an EBS volume from an AMI, you get the crappy copy-on-read behavior. It is probably faster to initialize a fresh EBS volume and stream bits from S3 than to initialize from an AMI and touch all sectors via format.
Blocks: 1305174
Blocks: 1381768
Assignee: relops → rthijssen
just an update that the implementation that mounts fresh ebs volumes (on spot instances) at boot, is working. i'm testing on gecko-1-b-win2012-beta.

basic design is:
- ami contains only the c: drive which has dependencies installed through occ during golden ami run
- occ updates the provisioner config with a section like this:

      "launchSpec": {
        "BlockDeviceMappings": [
          {
            "DeviceName": "/dev/sda1",
            "Ebs": {
              "DeleteOnTermination": true,
              "VolumeSize": 40
            }
          },
          {
            "DeviceName": "/dev/sdb",
            "Ebs": {
              "DeleteOnTermination": true,
              "VolumeSize": 120
            }
          }
        ],
        ...

- occ/dsc runs again on the spot instance and initialises /dev/sdb with two partitions for y: and z:, quick formats these
  and assigns drive letters (https://github.com/mozilla-releng/OpenCloudConfig/blob/346047b7/userdata/rundsc.ps1#L294-L355)

i ran into problems with todays testing, due to a missing cot gpg key for gecko-1-b-win2012-beta which meant that test builds failed (https://treeherder.mozilla.org/#/jobs?repo=try&revision=59e97080262b13037129a420613c3d0d229da018&group_state=expanded&exclusion_profile=false&filter-searchStr=tc). but I expect to have this resolved and deployed tomorrow to gecko-(1-3)-b-win2012.
(In reply to Amy Rich [:arr] [:arich] from comment #3)
> This got landed in try today in
> https://github.com/mozilla-releng/OpenCloudConfig/commit/
> c819210a76021161285fb782a708316fd8e2807e but signs point to it causing this
> problem:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=074acb44df33719de2e47ee941887fc8da6e61e4&selectedJob=1
> 23961923
After conversation in irc with arr and gps. 
https://github.com/mozilla-releng/OpenCloudConfig/commit/44f6633a88caefb776c294e86a325d8e3b6f6554
Rob mentioned today that he was testing an updated patch in try again.
Flags: needinfo?(rthijssen)
yes. has been running for nearly 24 hours on gecko-1-b-win2012 without any hg path length exceptions. probably due to the robustcheckout updates and the removal of the hg precache as well.
https://github.com/mozilla-releng/OpenCloudConfig/commit/c087f802347464ee3084a66ee8c7590bf852ec15

promoting now to gecko-2-b-win2012 & gecko-3-b-win2012
https://github.com/mozilla-releng/OpenCloudConfig/commit/bfaef41031359709dfde980669f3aa0dbb193410
Flags: needinfo?(rthijssen)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Thanks Rob!
rob, do you know if this could affect the testers as well?  I believe they do formats too, but the patch is for changing the builders.
Flags: needinfo?(rthijssen)
yes, the patch went live for win 7 and 10 this afternoon as well.
Flags: needinfo?(rthijssen)
You need to log in before you can comment on or make changes to this bug.