Closed Bug 1305174 Opened 8 years ago Closed 6 years ago

EBS initialization makes I/O absurdly slow on freshly provisioned instances

Categories

(Infrastructure & Operations :: RelOps: General, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: markco)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Windows])

I was investigating an apparent Mercurial 3.9 performance regression in bug 1304791. :markco gave me access to a Windows Server 2008 instance in use1. Not only did I notice that Mercurial performance was horrible, but I/O was just generally bad. Downloading a 1.5 GB Mercurial bundle from S3, I noticed that data was effectively streaming to memory and being written back out to disk at only 2-3 MB/s. Profiling system calls using Process Monitor confirmed that low-level file write function calls were taking 0.5-1.2s to write 2 MB. Ouch.

arr found an AWS support thread which led her to the following:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html
http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ebs-initialize.html

From these docs:

> New EBS volumes receive their maximum performance the moment that they
> are available and do not require initialization (formerly known as
> pre-warming). However, storage blocks on volumes that were restored
> from snapshots must be initialized (pulled down from Amazon S3 and
> written to the volume) before you can access the block. This preliminary
> action takes time and can cause a significant increase in the latency
> of an I/O operation the first time each block is accessed.

So basically if you create an EBS volume from a snapshot, the initial I/O to a block will be slow. That means the first time you write data to a block, that write will be slow.

I used a Windows port of dd to trigger reads from all blocks on the C:\ drive of the instance I was given access to. It was painfully slow. 2-3MB/s. I let it run for a few minutes then killed it. When I started the process again, it very quickly read up to the point I previously killed it at. This seemingly confirms that EBS volumes initialized from an AMI count as a "snapshot" and suffer from the slow first block access issue.

Further reinforcing this suspicion, markco mounted a fresh EBS volume on this same instance. I formatted it as NTFS and conducted similar tests. Disk writes during wget were at least 4x faster. Mercurial operations were faster. It didn't seem to suffer from the slow first block access problem. And this makes sense: Amazon says new EBS volumes have maximum performance from the beginning.

Since EBS volumes initialized from snapshots/AMIs are absurdly slow, I strongly recommend changing our EC2 instance management strategy to do as many operations as possible on a fresh EBS volume that is separate from the instance/AMI volume.

This means builds and tests should have their workspace on a separate EBS volume.

Ideally, any additional programs we install (like mozilla-build) should also be put on a fresh EBS volume. This does mean we lose the advantage of AMIs and their bake once reuse everywhere approach. However, if the "reuse everywhere" bit means 2 MB/s I/O on first use, I dare say the overhead of installing things on N separate instances may be faster when you factor in the performance penalty on first block access.
Depending on how Firefox automation provisions EC2 instances, this issue could potentially be costing us $100,000+. I figure if this is slowing down instances on average by 10% (possibly more since some spot instances aren't alive very long - especially in TaskCluster), you can equate that to the need to run ~10% more instances to provide equal capacity. Then you factor in lower developer productivity due to builds and tests taking several minutes longer than they could.

Greg and Chris: can you please assess the exposure of Firefox CI automation to this issue? Basically, I want to know what instance types we provision like our Windows 2008 buildbot builders [that would exhibit the EBS initialization slowdown]. Then we need to know how many new instances we create. If instances live for weeks at a time, the severity of this issue is far less than if impacted instances only live for a few hours.
Flags: needinfo?(garndt)
Flags: needinfo?(catlee)
Per #taskcluster, docker-worker attaches a fresh EBS volume on instance startup and it doesn't rely on the EBS system volume too much. So it doesn't appear too impacted by this.

But we don't know the state of Windows TaskCluster. Switching needinfo to :grenade.
Flags: needinfo?(garndt) → needinfo?(rthijssen)
I provisioned a fresh Windows Server 2008 c4.2xlarge instance using an official AMI hosted by Amazon. I can confirm the slow I/O on first block access problems is present in that scenario.

So basically anything on the AMI / root EBS volume will be insanely slow until first access. This has drastic implications. For example, the pagefile is on the root EBS volume. So if we start paging (and many operating systems are aggressive about paging by default), we could get hit by this issue. We really need to be treating the root EBS volume as read-only and we should minimize read access to it.
In bug 1304791 comment 7, I believe I also identified that our existing EBS volume IOPS are limiting Mercurial performance. And given that the Firefox build system is I/O heavy on all platforms and that I/O slowness making various jobs slow has been a recurring theme, I'm starting to question whether we've configured high enough IOPS. That's arguably for a separate bug. But it is definitely related to the theme of "I/O is slow."
I'll still let grenade chime in, but looking at our win2012 and win7 instances, it seems the only EBS volume is the root device.  I believe the instance uses the provided SSDs for things like tasks and generic-worker.
Over the past several months (even years), I've heard "we measured performance of C4 instances and they were slower than C3... we need to stick to ephemeral storage on C3" several times. Each time I was skeptical. Well, it certainly looks like this EBS initialization issue was the root cause. (Context: C3 instances have a root volume attached to the instance ("ephemeral storage"). C4 instances have a root EBS volume with no option for ephemeral storage.) I'm glad we finally got to the bottom of that.
With a sample size of 1 in usw2, the root EBS volume on a Windows Server 2012 instance appears to be 2-3x faster than Windows Server 2008. The I/O on first access is still slow. But compared to ~2MB/s, it definitely feels snappier. It's fast enough for Mercurial operations to not be absurdly slow. But it is still slow - like crappy 5400 RPM laptop disk from 2005 slow.

I/O on new EBS volumes also appears faster under Windows Server 2012.

This reaffirms my observations from several months back where I measured Windows Server 2012 I/O performance to be significantly better than Windows Server 2008. Not sure what's going on here.
Looking at system calls during a Firefox build, cl.exe (the Visual Studio compiler) writes a bunch of temp files in %TEMP%, which is on c:\ and part of the root EBS volume. That almost certainly results in slower Firefox builds even if building from a mounted/fresh EBS volume.

In fact, if I run `dd` to perform a scanning read on the root volume during a Firefox build, I can make CPU usage from the build drop drastically. Presumably that's due to the build blocked on I/O wait for something on the root volume.
TaskCluster Windows workers have the c: drive on an EBS volume but perform builds and tests on the z: drive which is mapped to the free SSD ephemeral volume. That doesn't mean they aren't affected though.

1) The AMI creation process uses the c:\builds\hg-shared convention to house an initial clone of mozilla-central which builds use when they run their initial `hg share c:\builds\hg-shared\mozilla-central ...`. If reading from c: is slow on the first run, we're taking a hit there.

2) Additionally, there is an issue in the generic-worker which incorrectly sets the path in the APPDATA, LOCALAPPDATA, TMP, TEMP and USERPROFILE environment variables to folders on the c: drive (they should be on the z: drive, but aren't), which causes both builds and tests to make heavy use of c: for temporary files. I described this problem here: https://bugzilla.mozilla.org/show_bug.cgi?id=1303455#c10

Fixing the issue with the local cache in hg-shared is an interesting problem. Either we stop relying on a local cache and always clone (we'd be using clonebundles from regional s3 buckets, so that's probably faster anyway) or we'd have to create (or prime) the local cache on first boot of the spot instance. I'd love to get some advice on which approach is preferred and I'm happy to implement, either way.

The second issue needs to be patched in generic-worker. I've ni'd pmoore for comment on the feasibility of that.
Flags: needinfo?(rthijssen) → needinfo?(pmoore)
I don't have data on how long lived our Windows instances are at the moment. My hunch is that they generally live most of the day after being started up, and are killed off at night when load drops.

On Linux in buildbot, we're only using instances with local SSDs, so don't suffer this issue.

Does this problem affect just the root EBS volumes, or any EBS volume that's created from a snapshot?
Flags: needinfo?(catlee)
(In reply to Rob Thijssen (:grenade - GMT) from comment #9) 
> The second issue needs to be patched in generic-worker. I've ni'd pmoore for
> comment on the feasibility of that.

I'm currently investigating how I can make sure the USERPROFILE is set to a custom location for a newly created local user... We'll track the work on that in the other bug (bug 1303455).

I'm attempting first to adapt the underlying system calls, rather than adjust the environment variables directly, in case there are other references to the C:\Users location outside of the env variables that might not get updated by just fixing the env vars.
Flags: needinfo?(pmoore)
It affects any EBS volume created from a snapshot.
(In reply to Rob Thijssen (:grenade - GMT) from comment #9)
> TaskCluster Windows workers have the c: drive on an EBS volume but perform
> builds and tests on the z: drive which is mapped to the free SSD ephemeral
> volume. That doesn't mean they aren't affected though.

Good. It's worth noting that the ephemeral SSDs go away with C4 instances. We're already using C4 instances for TC Linux. (TC Linux mounts a fresh EBS volume for Docker foo though.)
 
> 1) The AMI creation process uses the c:\builds\hg-shared convention to house
> an initial clone of mozilla-central which builds use when they run their
> initial `hg share c:\builds\hg-shared\mozilla-central ...`. If reading from
> c: is slow on the first run, we're taking a hit there.

The initial "checkout" portion of this share will be relatively slow, as Mercurial will have to read data on the EBS snapshot. Fortunately, it doesn't need to read all data (just indexes and the portions of files holding the revision being checked out). But I feel this will still be pretty slow - possibly slower than doing a stream clone to z:\.

> Fixing the issue with the local cache in hg-shared is an interesting
> problem. Either we stop relying on a local cache and always clone (we'd be
> using clonebundles from regional s3 buckets, so that's probably faster
> anyway) or we'd have to create (or prime) the local cache on first boot of
> the spot instance. I'd love to get some advice on which approach is
> preferred and I'm happy to implement, either way.

On Linux TC, we perform a clone+checkout on first task run.

In London, I discussed the bootstrap time versus task time approach with Jonas and others. Essentially, both operations take the same amount of time. So it comes down to whether you want this logic living in the TC "platform" layer or in the task layer. We decided it was best to have it in the task layer because:

* More visibility in task logs (also makes it easier to measure overhead)
* More control (easier to change tasks than change the worker)
* VCS interaction is really a task specific action, not a worker one
Depends on: 1305485
Blocks: 450645
No longer blocks: 450645
https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win32-debug/1475592600/autoland_win7_vm-debug_test-web-platform-tests-7-bm138-tests1-windows-build279.txt.gz is a Windows 7 VM test taking like 30 minutes to perform an `hg update`. That's running on t-w732-spot-283 and the absurdly slow I/O is almost certainly due to this bug.
Depends on: 1307798
Assignee: relops → mcornmesser
Let's initially focus on the builders since that's where we see so many issues with hg, but we should also split out the working directory for the testers as well. I believe Q mentioned a way that we could mount a new drive under C:\, but if that proves to not be the case, we'll need to work with releng to change the locations of the build to a new drive. I suspect that will be very complicated, so seeing if we can keep the pathing the same is optimal.
As I mentioned in another bug, we can use an NTFS junction to map a path on c:\ to some other (EBS mounted) drive. These work like symlinks and everything should "just work" (/me knocks on wood).
Whiteboard: [Windows]
Depends on: 1378381
might be safe to close this bug. tc win implemented fixes and i'm not sure there's enough of a time/volume problem in builbot to warrant addressing it.
tc win instances have been using fresh ebs mounts since these changes (several months ago now):
https://github.com/mozilla-releng/OpenCloudConfig/commit/bf26a64d67f5e74aa9fe901bbd93c48953cf3c65
https://github.com/mozilla-releng/OpenCloudConfig/commit/28dcf8c9d4370aae5b138382da085f48b7ad8469
I also believe we addressed most of the big offenders here.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.