Closed
Bug 1307798
Opened 8 years ago
Closed 8 years ago
Windows instances in AWS should not be doing work on c:\
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: gps, Assigned: markco)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [Windows])
Attachments
(3 files)
Currently, tests executing on Windows AWS testers are working out of c:\. Unfortunately, c:\ on AWS is horrendously slow due to bug 1305174. I/O on c:\ is likely horrible. This almost certainly slows down tests, which makes end-to-end times slower and wastes money because we need more instances to do the same amount of work.
Establishing a fresh EBS volume (or using ephemeral volumes) and having tests run from there should make things faster.
catlee: please triage this request
Flags: needinfo?(catlee)
Comment 1•8 years ago
|
||
This has a very high impact in our end to end times.
Specially noticeable when doing artifact builds on try where cloning and updating takes longer than the build process.
We should also make similar changes for test machines for when we look into running them from the source checkout.
This is already the case for web-platform-tests on Linux/TC.
Comment 2•8 years ago
|
||
I believe that Mark is working on the builders at the moment in bug 1305174.
Assignee: nobody → relops
Component: General Automation → RelOps
Flags: needinfo?(catlee)
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → arich
Assignee | ||
Updated•8 years ago
|
Assignee: relops → mcornmesser
Assignee | ||
Comment 3•8 years ago
|
||
What I am trying to do is mount a drive directly to the directory or directories in which the work is being done.
Do we have a list of specific directories in which work is being done? "C:\builds\moz2_slave" and "C:\builds\hg-shared"?
Alternatively does anyone know if we can move the keys and tokens in C:\builds to another location? If that is a possibility we can mapped the drive directly to C:\builds. If not then the drive will need to be mapped to individual directories. The reason is because the drive mapping must be done to empty directories.
Comment 4•8 years ago
|
||
NI catlee with the hope that he can point you at someone in releng who can answer your questions.
Flags: needinfo?(catlee)
Reporter | ||
Comment 5•8 years ago
|
||
A hacky option is to create an NTFS junction that maps e.g. c:\builds to an EBS mounted drive. In theory this will slow down access to that path by having to resolve the junction. But that should be much faster than using an EBS mounted drive.
Comment 6•8 years ago
|
||
I think Joel is probably much more familiar with which paths are used by tests. I think most (all?) of the path references are stored in mozharness configs now.
Can we move keys and tokens that are currently in C:\builds to the EBS volume? Or to another temporary location on C: and then copy them into C:\builds once we've mounted the EBS volume?
Flags: needinfo?(catlee)
Assignee | ||
Comment 7•8 years ago
|
||
Update: I now have a function to be added into the Ec2UserdataUtils.psm1 that will be called by the userscript. The function will mount he ephemeral drives to c:\builds for 2008 and C:slave\test\build for Windows 7.
On 2008 with the ephemeral drives mounted and performing an hg clone I am seeing a transfer rate between 7.5 and 8 MB/sec. I have not tested on Windows 7 as of yet.
At this point I need to add logic to it to move the contents of those directories out pre-mount and back in post mount. As well as figuring out other cloud tools pieces.
Assignee | ||
Updated•8 years ago
|
Whiteboard: [Windows]
Assignee | ||
Updated•8 years ago
|
Summary: Windows testers in AWS should not be doing work on c:\ → Windows instances in AWS should not be doing work on c:\
Assignee | ||
Comment 8•8 years ago
|
||
Update:
I have an userdata change that will map the c:\builds for 2008 and c:\slave\test\build for Windows 7 to the ephemeral drives instance share 1 and 2 for c3.2xlarge instances.
For Windows 7 there is a bit of an issue with the existing ram disk and the instance seeing it as disk 1. Which conflicts with one of the ephemeral drives. This is a particular issue on g2 instances since there is only one ephemeral drive.
For 2008 it is generally functioning. The only possible issue is on shutdown, not reboot, the C:\builds directory becomes unreachable. This might only be an issue for loaned machines since during the life of a production instance it is rebooted and not shutdown. So we may just need to have loaned instances a different type to avoid this.
Assignee | ||
Comment 9•8 years ago
|
||
Grenade: this is what I have thus far for the prep spot function:
function Prep-Spot {
param (
[switch] $force,
[string] $tempdir = ('{0}\builds' -f $env:ProgramData)
)
begin {
Write-Log -message ("{0} :: Function started" -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
}
process {
$empdisk = "D:"
$diskexist = Test-Path $empdisk
If ($diskexist -eq $False) {
Write-Log -message ("{0} :: Ephemeral disks are not available " -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
#if ($false) {
#Write-Log -message ("{0} :: detected prior run. skipping spot setup" -f $($MyInvocation.MyCommand.Name)) -severity 'DEBUG'
} else {
if ($env:ComputerName.contains('w732')) {
Write-Host "Windows 7 detected. Mounting ephemeral drives to C:\slave\test\build"
Write-Log -message "Mounting ephemeral drives to C:\slave\test\build"
$mountPath = ('{0}\slave\test\build' -f $env:SystemDrive)
Get-ChildItem -Path $mountPath | Move-Item -destination $tempdir
if ((Test-Path -Path $mountPath -PathType Container -ErrorAction SilentlyContinue) -and ((Get-ChildItem $mountPath | Measure-Object).Count -eq 0)) {
Mount-EphemeralDisks
Get-ChildItem -Path $tempdir | Copy-Item -destination $mountPath
}
}
Write-Host $env:ComputerName
if ($env:ComputerName.Contains('2008')) {
Write-Host "Windows 2008 detected. Mounting ephemeral drives to C:\builds"
Write-Log -message "Mounting ephemeral drives to C:\builds"
$mountPath = ('{0}\builds' -f $env:SystemDrive)
Get-ChildItem -Path $mountPath | Move-Item -destination $tempdir
if ((Test-Path -Path $mountPath -PathType Container -ErrorAction SilentlyContinue) -and ((Get-ChildItem $mountPath | Measure-Object).Count -eq 0)) {
Mount-EphemeralDisks
Get-ChildItem -Path $tempdir | Copy-Item -destination $mountPath
}
}
}
}
With the:
$empdisk = "D:"
$diskexist = Test-Path $empdisk
If ($diskexist -eq $False) {
I am trying to check for the ephemeral disks before moving further down the function. DO you have any suggestions to get around checking directly for the D: drive? Also do you have any other suggestions?
Flags: needinfo?(rthijssen)
Comment 10•8 years ago
|
||
on win 7 with powershell < v3, options are limited. i had similar problems before updating to newer ps. newer ps has cmdlets for volume management like Get-Volume, which doesn't need the volume to be mounted in order to work.
you can use things like: `Get-WmiObject -Class win32_diskdrive` which will work and is only slightly messier.
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 11•8 years ago
|
||
Catlee: I am going to need some direction from releng on this because any changes here will need to consider price and instance availability. Below are times from various instance types with various drives mounted performing a clone of https://hg.mozilla.org/mozilla-central/. Where noted I tested without a paging file.
It seems that c3.2xlarge and c3.4xlarge with ephemeral disk seems to have a significant performance improvement and a greater improvement without the paging file. My concern is first instance availability c3.2xlarge and instance price with c3.4xlarge As well I am not sure if the lack of paging file would cause an issue else where in the build.
For the c4.2xlarge the only significant improvement was with the general purpose SSD mounted and no paging file. Other c4.2xlarge configurations without the paging file expedience a sever hit to the performance.
Overall I am not sure how we would want to move forward on this.
Type c4.2xlarge
No additional disks
269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 1472.0 seconds (1.27 MB/sec)
finished applying clone bundle
Type c4.2xlarge
General Purpose SSD
269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 1301.9 seconds (1.44 MB/sec)
finished applying clone bundle
Type c4.2xlarge
General Purpose SSD With no paging file
269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 676.0 seconds (2.77 MB/sec)
finished applying clone bundle
Type c4.2xlarge
Provisioned IOPS SSD
transferred 1.83 GB in 1676.4 seconds (1.12 MB/sec)
finished applying clone bundle
searching for changes
c3.2xlarge
Instance share (Ephemeral)
269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 491.2 seconds (3.81 MB/sec)
finished applying clone bundle
c3.2xlarge
Instance share (Ephemeral) with no paging file
269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 317.4 seconds (5.90 MB/sec)
finished applying clone bundle
c3.4xlarge
Instance share (Ephemeral)
269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 616.3 seconds (3.04 MB/sec)
finished applying clone bundle
c3.4xlarge
Instance share (Ephemeral) with no paging file
269488 files to transfer, 1.83 GB of data
transferred 1.83 GB in 342.9 seconds (5.46 MB/sec)
finished applying clone bundle
Each clone was performed a fresh instance based off of spot-b-2008-2016-11-15-10-44 (ami-673b1370).
Flags: needinfo?(catlee)
Comment 12•8 years ago
|
||
:markco: I don't see any numbers with c4.4xlarge (which is what taskcluster has moved to). Did you test there?
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 13•8 years ago
|
||
I will test on c4.4xlarge tomorrow. I performed one I did not record without an additional drive mounted and the performance was lack luster. I will do some further testing with mounted drives and without the paging file.
We may not be able to get the same performance as taskcluster on the buildbot builders because of the old 32 bit version of python that is being used.
Flags: needinfo?(mcornmesser)
Reporter | ||
Comment 14•8 years ago
|
||
Something is still wrong with I/O if it is only doing 1-6MB/s. https://public-artifacts.taskcluster.net/BoCy9jVqQAChL4qBXj-s1w/0/public/logs/live_backing.log is a TC job on a c4.4xlarge doing ~2x that:
ensuring https://hg.mozilla.org/integration/autoland/@9f88b41193c6bf9730f75170ecb4cdb932450738 is available at .\build\src
(cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
(sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
applying clone bundle from https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-unified/7f782f9a462dca67d483c5454658d4f0c89a2a51.packed1-gd.hg
270073 files to transfer, 1.54 GB of data
transferred 1.54 GB in 138.4 seconds (11.4 MB/sec)
Assignee | ||
Comment 15•8 years ago
|
||
Testing on c4.4xlarge with a mounted ssd and no paging file:
270048 files to transfer, 1.83 GB of data
transferred 1.83 GB in 506.5 seconds (3.70 MB/sec)
finished applying clone bundle
Better than current but still not close to taskcluster. I will need to go back through what has been done for the configuration for taskcluster to try to figure out the reason for the difference.
Comment 16•8 years ago
|
||
mark, one thing to check is what process priority the hg process is running under (use process explorer/monitor). if it is running as normal, above normal or high, ignore this suggestion. if it is running as below normal, you can set the following registry keys to assign it a higher priority (above normal cpu, normal io):
Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\hg.exe\PerfOptions
Name: CpuPriorityClass
Value: 0x00000006 (dword)
Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\hg.exe\PerfOptions
Name: IoPriority
Value: 0x00000002 (dword)
Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\python.exe\PerfOptions
Name: CpuPriorityClass
Value: 0x00000006 (dword)
Path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\python.exe\PerfOptions
Name: IoPriority
Value: 0x00000002 (dword)
the other thing that made a positive difference on taskcluster was using a separate EBS drive for the build (not C: and not ephemeral, just a large (80gb) bog standard EBS SSD)) you will notice when you select the second drive in EC2 console and play with the size of it, the IOPS get larger as the drive gets larger.
on tc win builds:
- the second ebs drive gave us a 30% performance boost on all worker types (win 7, 10, 2012).
- the hg priority registry hack gave a 600% performance boost on win 7 and a barely noticeable boost everywhere else.
Assignee | ||
Comment 17•8 years ago
|
||
Update for Win 7: In order to user the ephemeral disks the ram disk will need to be removed and remounted. There is an odd situation where the at some level the instance is labeling the ram disk as disk 1. When the ephemeral drive is mounted the disk manager is seeing it as disk one, so there is a bit of a conflict there.
Assignee | ||
Comment 18•8 years ago
|
||
For 2008 with the configuration mentioned in Comment 16, the clone times are better, but is still significantly less than Task Cluster. The transfer speed was between 3.0 and 4.0 MB/sec.
Grenade: For TC Win 7 what kind if transfer rate have you seen?
Flags: needinfo?(rthijssen)
Comment 19•8 years ago
|
||
we don't have any reliable data for win7 because we don't have any tasks that perform hg checkouts on that os. here's one i ran just now:
- win 7:
https://g2mtvxyaaaavrla3wavuibgyowmjngyfo42ecqzxsxp4s7nl.taskcluster-worker.net:60023/log/CBaKTjMqQO-BbC_zq3hD_w
270409 files to transfer, 1.54 GB of data
transferred 1.54 GB in 622.8 seconds (2.54 MB/sec)
i was curious about the other os'es so tested them too:
- win 10:
https://g3l3omaaaaavrlbj2gll7xvvshqcwx4spienv3na72zd4cy6.taskcluster-worker.net:60023/log/FYtKr9VRRnK81l4jnk8P4g
270409 files to transfer, 1.54 GB of data
transferred 1.54 GB in 142.1 seconds (11.1 MB/sec)
- win2 2012:
https://g2kvn4aaaaavrlblxuycv3hracegh56g4bcf667giknwl636.taskcluster-worker.net:60023/log/WCWjZ_wGR9KEKL9x65nSyw
270409 files to transfer, 1.54 GB of data
transferred 1.54 GB in 138.0 seconds (11.5 MB/sec)
win 7 is slow. before we gave the hg process 'above normal' priority, it was even slower (the same checkout would time out after 3 hours. i never ran one to completion)
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 20•8 years ago
|
||
The registry settings from comment 19 have very little if any impact on 2008.
Thus far the best results I have are:
c3.2xlarge with ephemeral drive mounted and no paging file with 6+ MB/sec
c3.4xlarge with ephemeral drive mounted and no paging file with +/- 6 MB/sec
c4.4xlarge with paging file set 46078 (recommended size by the OS) +/- 4 MB/sec
These are better than the current performance. The difference between this and TC could possible due to the difference in the age of the OSes. I am going to spend a but more time to try to get the performance improved.
However, I will still need direction from releng on which instance type(s) to use moving forward.
Assignee | ||
Comment 21•8 years ago
|
||
grenade: Is there a location you con point me to to take a look at the mercurial.ini file you are using on Task Cluster instances?
Flags: needinfo?(rthijssen)
Comment 22•8 years ago
|
||
yes, it's here: https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Configuration/Mercurial/mercurial.ini
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 23•8 years ago
|
||
<@catlee> markco: what do you need to know in https://bugzilla.mozilla.org/show_bug.cgi?id=1307798
<markco> Catlee: Which instance type we want to use for the 2008 instances? C3.2xlarge or c3.4xlarge with ephemeral drives mounted or c4.4 xlarge with an ssd drive mounted? Much better performance with one of the first two.
<@catlee> what are we using right now?
<markco> c4.2xlarge
<@catlee> c3.4xlarge sounds good
<markco> catlee: rgr I will move forward with that
Flags: needinfo?(catlee)
Assignee | ||
Comment 24•8 years ago
|
||
This is what I have for the function.
I am not going to be able to put in a disk check into the function becuase of lack of functionality of powershell on 2008 and because the ephemeral drives on the c3.4xlarge are not formatted when they are mounted.
grenade: Do think there are any other pieces I should add to this?
Attachment #8817019 -
Flags: feedback?(rthijssen)
Assignee | ||
Comment 25•8 years ago
|
||
Q: When you have a chance could you attach the ram disk scripts and details to this bug, please?
Flags: needinfo?(q)
Comment 26•8 years ago
|
||
Comment on attachment 8817019 [details]
Prep spot function starting at line 899 of Ec2UserdataUtils.psm1
lgtm
minor nit: you could lose the if($false)/else block that's no longer used (replaced with the current contents of the else section).
Attachment #8817019 -
Flags: feedback?(rthijssen) → feedback+
Assignee | ||
Comment 27•8 years ago
|
||
Grenade: If you are good with this, we should discuss the date and time to merge it.
Attachment #8820458 -
Flags: review?(rthijssen)
Comment 28•8 years ago
|
||
Comment on attachment 8820458 [details] [review]
Bug 1307798 - 2008 ephemeral drive support
I'm happy to do the merge, watch, validate/rollback dance whenever you want to pull the trigger.
Attachment #8820458 -
Flags: review?(rthijssen) → review+
Assignee | ||
Comment 29•8 years ago
|
||
(In reply to Rob Thijssen (:grenade - GMT) from comment #28)
> Comment on attachment 8820458 [details] [review]
> Bug 1307798 - 2008 ephemeral drive support
>
> I'm happy to do the merge, watch, validate/rollback dance whenever you want
> to pull the trigger.
Could you merge before the golden process kicks off on 2017-01-03? And keep an eye out on it till I get on later that day?
If we can do that I will send out an email to give a heads up on it.
Flags: needinfo?(rthijssen)
Comment 31•8 years ago
|
||
Assignee | ||
Comment 32•8 years ago
|
||
Attachment #8823298 -
Flags: review?(rthijssen)
Comment 33•8 years ago
|
||
Comment on attachment 8823298 [details] [review]
Update watch pending and y-2008 instance configs
merged: https://github.com/mozilla-releng/build-cloud-tools/commit/99fc551cded99c54251ffcaeb57f3ea87b8b27f6
Attachment #8823298 -
Flags: review?(rthijssen) → review+
Assignee | ||
Comment 34•8 years ago
|
||
Windows 7 is proving to be rather problematic.
For an unknown reason, and a high percentage of the time, Windows 7 does not see all of its connect drives. It seems that whatever drive gets labeled as disc one does not appear as a usable drive. The OS sees it as a disc, but can not mount it to be usable. When attempting to access the drive through through the drive manger GUI it begins to throw errors that the console is not up to date.
To work around this I began to test with a small SSD (1 Gig) as disc one and the ephemeral drives at disc 2 and 3. This seemed to have work at first, but while I was working performance tuning, during one iteration the one of the ephemeral drives was seen as the unusable drive and the the 1 Gig ssd was mounted to the directory in the userdata.
Thus far there has been no apparent reason for this behavior, or why one out 5 newly spawned instances saw the SSD driver as usable and not the ephemeral drive.
Looking at HG clones times, I don't know how much we would be able to gain by mounting an additional drive in hack around way.
With a warm drive (an instances that has been running tests)
4.03 MB/sec
Newly span instance with a "cold" drive:
3.58 MB/sec
Newly span instance with mounted SSD (capped in the image)
3.74 MB/sec
Comment 35•8 years ago
|
||
I had the same problem with Windows 7 in TaskCluster worker types. More often than not, it wasn't possible to mount a third volume. I never got it properly resolved. When we switched from ephemeral to EBS volumes, I couldn't get the 3rd volume to mount correctly no matter what (we use 3 on TC. C: OS, Y: Caches, Z: Tasks). In the end I gave up on the 3rd volume and only mounted a second volume and partitioned it into two logical drives (y:, z:), then captured that as an AMI. We still use this fudged AMI for all the Win 7 TC testers.
Assignee | ||
Comment 36•8 years ago
|
||
Resolving with won'tfix in regards to Windows 7.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•