Closed Bug 1307798 Opened 8 years ago Closed 8 years ago

Windows instances in AWS should not be doing work on c:\

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Assigned: markco)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [Windows])

Attachments

(3 files)

Currently, tests executing on Windows AWS testers are working out of c:\. Unfortunately, c:\ on AWS is horrendously slow due to bug 1305174. I/O on c:\ is likely horrible. This almost certainly slows down tests, which makes end-to-end times slower and wastes money because we need more instances to do the same amount of work.

Establishing a fresh EBS volume (or using ephemeral volumes) and having tests run from there should make things faster.

catlee: please triage this request
Flags: needinfo?(catlee)
This has a very high impact in our end to end times.
Specially noticeable when doing artifact builds on try where cloning and updating takes longer than the build process.

We should also make similar changes for test machines for when we look into running them from the source checkout.
This is already the case for web-platform-tests on Linux/TC.
I believe that Mark is working on the builders at the moment in bug 1305174.
Assignee: nobody → relops
Component: General Automation → RelOps
Flags: needinfo?(catlee)
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → arich
Assignee: relops → mcornmesser
What I am trying to do is mount a drive directly to the directory or directories in which the work is being done.

Do we have a list of specific directories in which work is being done? "C:\builds\moz2_slave" and "C:\builds\hg-shared"?

Alternatively does anyone know if we can move the keys and tokens in C:\builds to another location? If that is a possibility we can mapped the drive directly to C:\builds. If not then the drive will need to be mapped to individual directories. The reason is because the drive mapping must be done to empty directories.
NI catlee with the hope that he can point you at someone in releng who can answer your questions.
Flags: needinfo?(catlee)
A hacky option is to create an NTFS junction that maps e.g. c:\builds to an EBS mounted drive. In theory this will slow down access to that path by having to resolve the junction. But that should be much faster than using an EBS mounted drive.
I think Joel is probably much more familiar with which paths are used by tests. I think most (all?) of the path references are stored in mozharness configs now.

Can we move keys and tokens that are currently in C:\builds to the EBS volume? Or to another temporary location on C: and then copy them into C:\builds once we've mounted the EBS volume?
Flags: needinfo?(catlee)
Update: I now have a function to be added into the Ec2UserdataUtils.psm1 that will be called by the userscript. The function will mount he ephemeral drives to c:\builds for 2008 and C:slave\test\build for Windows 7.

On 2008 with the ephemeral drives mounted and performing an hg clone I am seeing a transfer rate between 7.5 and 8 MB/sec. I have not tested on Windows 7 as of yet. 

At this point I need to add logic to it to move the contents of those directories out pre-mount and back in post mount. As well as figuring out other cloud tools pieces.
Whiteboard: [Windows]
Summary: Windows testers in AWS should not be doing work on c:\ → Windows instances in AWS should not be doing work on c:\
Update:

I have an userdata change that will map the c:\builds for 2008 and c:\slave\test\build for Windows 7 to the ephemeral drives  instance share 1 and 2 for c3.2xlarge instances.

For Windows 7 there is a bit of an issue with the existing ram disk and the instance seeing it as disk 1. Which conflicts with one of the ephemeral drives. This is a particular issue on g2 instances since there is only one ephemeral drive.

For 2008 it is generally functioning. The only possible issue is on shutdown, not reboot, the C:\builds directory becomes unreachable. This might only be an issue for loaned machines since during the life of a production instance it is rebooted and not shutdown. So we may just need to have loaned instances a different type to avoid this.
Grenade: this is what I have thus far for the prep spot function: 

function Prep-Spot {
  param (
    [switch] $force,
    [string] $tempdir = ('{0}\builds' -f $env:ProgramData)
  )
  begin {
    Write-Log -message ("{0} :: Function started" -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
  }
  process {
    $empdisk = "D:"
    $diskexist = Test-Path $empdisk
    If ($diskexist -eq $False) {
      Write-Log -message ("{0} :: Ephemeral disks are not available " -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
    #if ($false) {
      #Write-Log -message ("{0} :: detected prior run. skipping spot setup" -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
    } else {
      if ($env:ComputerName.contains('w732')) {
        Write-Host "Windows 7 detected. Mounting ephemeral drives to C:\slave\test\build"
        Write-Log -message "Mounting ephemeral drives to C:\slave\test\build"
        $mountPath = ('{0}\slave\test\build' -f $env:SystemDrive)
        Get-ChildItem -Path $mountPath | Move-Item -destination $tempdir  
        if ((Test-Path -Path $mountPath -PathType Container -ErrorAction SilentlyContinue) -and ((Get-ChildItem $mountPath | Measure-Object).Count -eq 0)) {
          Mount-EphemeralDisks
           Get-ChildItem -Path $tempdir | Copy-Item -destination $mountPath
        }
      }
      Write-Host $env:ComputerName
      if ($env:ComputerName.Contains('2008')) {
      Write-Host "Windows 2008 detected. Mounting ephemeral drives to C:\builds"
      Write-Log -message "Mounting ephemeral drives to C:\builds"
        $mountPath = ('{0}\builds' -f $env:SystemDrive)
        Get-ChildItem -Path $mountPath | Move-Item -destination $tempdir
        if ((Test-Path -Path $mountPath -PathType Container -ErrorAction SilentlyContinue) -and ((Get-ChildItem $mountPath | Measure-Object).Count -eq 0)) {
          Mount-EphemeralDisks
          Get-ChildItem -Path $tempdir | Copy-Item -destination $mountPath   
        }
      }
    }
  }

With the:
    $empdisk = "D:"
    $diskexist = Test-Path $empdisk
    If ($diskexist -eq $False) {

I am trying to check for the ephemeral disks before moving further down the function. DO you have any suggestions to get around checking directly for the D: drive? Also do you have any other suggestions?
Flags: needinfo?(rthijssen)
on win 7 with powershell < v3, options are limited. i had similar problems before updating to newer ps. newer ps has cmdlets for volume management like Get-Volume, which doesn't need the volume to be mounted in order to work.

you can use things like: `Get-WmiObject -Class win32_diskdrive` which will work and is only slightly messier.
Flags: needinfo?(rthijssen)
Catlee: I am going to need some direction from releng on this because any changes here will need to consider price and instance availability.  Below are times from various instance types with various drives mounted performing a clone of https://hg.mozilla.org/mozilla-central/. Where noted I tested without a paging file.

It seems that c3.2xlarge and c3.4xlarge with ephemeral disk seems to have a significant performance improvement and a greater improvement without the paging file. My concern is first instance availability c3.2xlarge and instance price with c3.4xlarge As well I am not sure if the lack of paging file would cause an issue else where in the build. 

For the c4.2xlarge the only significant improvement was with the general purpose SSD mounted and no paging file. Other c4.2xlarge configurations without the paging file expedience a sever hit to the performance.

Overall I am not sure how we would want to move forward on this.

Type c4.2xlarge 
No additional disks 

269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 1472.0 seconds (1.27 MB/sec)
finished applying clone bundle


Type c4.2xlarge 
General Purpose SSD 

269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 1301.9 seconds (1.44 MB/sec)
finished applying clone bundle


Type c4.2xlarge 
General Purpose SSD With no paging file 

269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 676.0 seconds (2.77 MB/sec)
finished applying clone bundle



Type c4.2xlarge
Provisioned IOPS SSD 

transferred 1.83 GB in 1676.4 seconds (1.12 MB/sec)
finished applying clone bundle
searching for changes



c3.2xlarge 
Instance share (Ephemeral) 

269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 491.2 seconds (3.81 MB/sec)
finished applying clone bundle


c3.2xlarge 
Instance share (Ephemeral) with no paging file 

269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 317.4 seconds (5.90 MB/sec)
finished applying clone bundle


c3.4xlarge 
Instance share (Ephemeral) 

269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 616.3 seconds (3.04 MB/sec)
finished applying clone bundle


c3.4xlarge 
Instance share (Ephemeral) with no paging file 

269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 342.9 seconds (5.46 MB/sec)
finished applying clone bundle 

Each clone was performed a fresh instance based off of spot-b-2008-2016-11-15-10-44 (ami-673b1370).
Flags: needinfo?(catlee)
:markco: I don't see any numbers with c4.4xlarge (which is what taskcluster has moved to). Did you test there?
Flags: needinfo?(mcornmesser)
I will test on c4.4xlarge tomorrow. I performed one I did not record without an additional drive mounted and the performance was lack luster. I will do some further testing with mounted drives and without the paging file. 

We may not be able to get the same performance as taskcluster on the buildbot builders because of the old 32 bit version of python that is being used.
Flags: needinfo?(mcornmesser)
Something is still wrong with I/O if it is only doing 1-6MB/s. https://public-artifacts.taskcluster.net/BoCy9jVqQAChL4qBXj-s1w/0/public/logs/live_backing.log is a TC job on a c4.4xlarge doing ~2x that:

ensuring https://hg.mozilla.org/integration/autoland/@9f88b41193c6bf9730f75170ecb4cdb932450738 is available at .\build\src
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
(sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
applying clone bundle from https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-unified/7f782f9a462dca67d483c5454658d4f0c89a2a51.packed1-gd.hg
270073 files to transfer, 1.54 GB of data
transferred 1.54 GB in 138.4 seconds (11.4 MB/sec)
Testing on c4.4xlarge with a mounted ssd and no paging file:

270048 files to transfer, 1.83 GB of data
transferred 1.83 GB in 506.5 seconds (3.70 MB/sec)
finished applying clone bundle

Better than current but still not close to taskcluster. I will need to go back through what has been done for the configuration for taskcluster to try to figure out the reason for the difference.
mark, one thing to check is what process priority the hg process is running under (use process explorer/monitor). if it is running as normal, above normal or high, ignore this suggestion. if it is running as below normal, you can set the following registry keys to assign it a higher priority (above normal cpu, normal io):

Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\hg.exe\PerfOptions
Name: CpuPriorityClass
Value: 0x00000006 (dword)

Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\hg.exe\PerfOptions
Name: IoPriority
Value: 0x00000002 (dword)

Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\python.exe\PerfOptions
Name: CpuPriorityClass
Value: 0x00000006 (dword)

Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\python.exe\PerfOptions
Name: IoPriority
Value: 0x00000002 (dword)

the other thing that made a positive difference on taskcluster was using a separate EBS drive for the build (not C: and not ephemeral, just a large (80gb) bog standard EBS SSD)) you will notice when you select the second drive in EC2 console and play with the size of it, the IOPS get larger as the drive gets larger.

on tc win builds:
- the second ebs drive gave us a 30% performance boost on all worker types (win 7, 10, 2012).
- the hg priority registry hack gave a 600% performance boost on win 7 and a barely noticeable boost everywhere else.
Update for Win 7: In order to user the ephemeral disks the ram disk will need to be removed and remounted. There is an odd situation where the at some level the instance is labeling the ram disk as disk 1. When the ephemeral drive is mounted the disk manager is seeing it as disk one, so there is a bit of a conflict there.
For 2008 with the configuration mentioned in Comment 16, the clone times are better, but is still significantly less than Task Cluster. The transfer speed was between 3.0 and 4.0 MB/sec. 

Grenade: For TC Win 7 what kind if transfer rate have you seen?
Flags: needinfo?(rthijssen)
we don't have any reliable data for win7 because we don't have any tasks that perform hg checkouts on that os. here's one i ran just now:

- win 7:
  https://g2mtvxyaaaavrla3wavuibgyowmjngyfo42ecqzxsxp4s7nl.taskcluster-worker.net:60023/log/CBaKTjMqQO-BbC_zq3hD_w
  270409 files to transfer, 1.54 GB of data
  transferred 1.54 GB in 622.8 seconds (2.54 MB/sec)

i was curious about the other os'es so tested them too:

- win 10:
  https://g3l3omaaaaavrlbj2gll7xvvshqcwx4spienv3na72zd4cy6.taskcluster-worker.net:60023/log/FYtKr9VRRnK81l4jnk8P4g
  270409 files to transfer, 1.54 GB of data
  transferred 1.54 GB in 142.1 seconds (11.1 MB/sec)

- win2 2012:
  https://g2kvn4aaaaavrlblxuycv3hracegh56g4bcf667giknwl636.taskcluster-worker.net:60023/log/WCWjZ_wGR9KEKL9x65nSyw
  270409 files to transfer, 1.54 GB of data
  transferred 1.54 GB in 138.0 seconds (11.5 MB/sec)

win 7 is slow. before we gave the hg process 'above normal' priority, it was even slower (the same checkout would time out after 3 hours. i never ran one to completion)
Flags: needinfo?(rthijssen)
The registry settings from comment 19 have very little if any impact on 2008. 

Thus far the best results I have are:

c3.2xlarge with ephemeral drive mounted and no paging file with 6+ MB/sec 
c3.4xlarge with ephemeral drive mounted and no paging file with +/- 6 MB/sec
c4.4xlarge with paging file set 46078 (recommended size by the OS) +/- 4 MB/sec

These are better than the current performance. The difference between this and TC could possible due to the difference in the age of the OSes. I am going to spend a but more time to try to get the performance improved. 

However, I will still need direction from releng on which instance type(s) to use moving forward.
grenade: Is there a location you con point me to to take a look at the mercurial.ini file you are using on Task Cluster instances?
Flags: needinfo?(rthijssen)
<@catlee> markco: what do you need to know in https://bugzilla.mozilla.org/show_bug.cgi?id=1307798
<markco> Catlee: Which instance type we want to use for the 2008 instances? C3.2xlarge or c3.4xlarge with ephemeral drives mounted or c4.4 xlarge with an ssd drive mounted? Much better performance with one of the first two.
<@catlee> what are we using right now?
<markco> c4.2xlarge
<@catlee> c3.4xlarge sounds good
<markco> catlee: rgr I will move forward with that
Flags: needinfo?(catlee)
This is what I have for the function. 

I am not going to be able to put in a disk check into the function becuase of lack of functionality of powershell on 2008 and because the ephemeral drives on the c3.4xlarge are not formatted when they are mounted.

grenade: Do think there are any other pieces I should add to this?
Attachment #8817019 - Flags: feedback?(rthijssen)
Q: When you have a chance could you attach the ram disk scripts and details to this bug, please?
Flags: needinfo?(q)
Comment on attachment 8817019 [details]
Prep spot function starting at line 899 of Ec2UserdataUtils.psm1

lgtm
minor nit: you could lose the if($false)/else block that's no longer used (replaced with the current contents of the else section).
Attachment #8817019 - Flags: feedback?(rthijssen) → feedback+
Grenade: If you are good with this, we should discuss the date and time to merge it.
Attachment #8820458 - Flags: review?(rthijssen)
Comment on attachment 8820458 [details] [review]
Bug 1307798 - 2008 ephemeral drive support

I'm happy to do the merge, watch, validate/rollback dance whenever you want to pull the trigger.
Attachment #8820458 - Flags: review?(rthijssen) → review+
(In reply to Rob Thijssen (:grenade - GMT) from comment #28)
> Comment on attachment 8820458 [details] [review]
> Bug 1307798 - 2008 ephemeral drive support
> 
> I'm happy to do the merge, watch, validate/rollback dance whenever you want
> to pull the trigger.

Could you merge before the golden process kicks off on 2017-01-03? And keep an eye out on it till I get on later that day? 

If we can do that I will send out an email to give a heads up on it.
Flags: needinfo?(rthijssen)
will do. added to calendar.
Flags: needinfo?(rthijssen)
Attachment #8823298 - Flags: review?(rthijssen)
Windows 7 is proving to be rather problematic.

For an unknown reason, and a high percentage of the time,  Windows 7 does not see all of its connect drives. It seems that whatever drive gets labeled as disc one does not appear as a usable drive. The OS sees it as a disc, but can not mount it to be usable. When attempting to access the drive through  through the drive manger GUI it begins to throw errors that the console is not up to date. 

To work around this I began to test with a small SSD (1 Gig) as disc one and the ephemeral drives at disc 2 and 3. This seemed to have work at first, but while I was working performance tuning, during one iteration the one of the ephemeral drives was seen as the unusable drive and the the 1 Gig ssd was mounted to the directory in the userdata. 

Thus far there has been no apparent reason for this behavior, or why one out 5 newly spawned instances saw the SSD driver as usable and not the ephemeral drive.

Looking at HG clones times, I don't know how much we would be able to gain by mounting an additional drive in hack around way.

With a warm drive (an instances that has been running tests)
4.03 MB/sec

Newly span instance with a "cold" drive: 
3.58 MB/sec

Newly span instance with mounted SSD (capped in the image)
3.74 MB/sec
I had the same problem with Windows 7 in TaskCluster worker types. More often than not, it wasn't possible to mount a third volume. I never got it properly resolved. When we switched from ephemeral to EBS volumes, I couldn't get the 3rd volume to mount correctly no matter what (we use 3 on TC. C: OS, Y: Caches, Z: Tasks). In the end I gave up on the 3rd volume and only mounted a second volume and partitioned it into two logical drives (y:, z:), then captured that as an AMI. We still use this fudged AMI for all the Win 7 TC testers.
Resolving with won'tfix in regards to Windows 7.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Flags: needinfo?(q)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: