Closed Bug 1207023 Opened 9 years ago Closed 9 years ago

delivery: set best instance type for upload.ffxbld.productdelivery.prod.mozaws.net

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: oremj)

References

Details

This is just for the ffxbld upload host in in production, everything else is much lower traffic. First lets bring over a comment from bug 1186297, where rail said:

I investigated multiple options to figure out what would be the optimal instance type for the upload host.

One of ideas was trying to simulate load comparable to what we have now.

First step was to figure out the current upload rates. I tried to use graphite, but the data is to coarse - tx/rx rates are not what we need.

I took this approach to figure out the upload rates we experience:

* find all files modified in particular period of time. I applied this to Firefox, Fennec and B2G files living on stage.m.o, modified within 24h for a busy day with multiple releases in fly (around Sep 15)
* Generate time series and analyze the rates. In our case the max is most important because we have to plan for peaks.

The results are the below:

30s max: 874 Mbps
1m max: 658 Mbps
3m max: 477 Mbps
5m max: 439 Mbps
10m max: 378 Mbps
1h max: 320 Mbps

Load simulation is a bit tricky task which may take a lot of resources. We thought that we could use taskcluster to spin up a lot of clients and upload some files. This will require some extra work to prep proper images with all needed secrets baked in and write custom scripts to generate traffic.

From our past experience with proxxy, we will need quite a beefy instance (assuming we can't use multiple instances in parallel) to meet the needed network performance. Per http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-config.html m3.2xlarge might be what we need.
----------

Rail, during the S3 migration meeting today we wondered
* are you selecting based on IOps, or network bandwidth ?
* the newer m4 instances have enhanced networking, but EBS instead of SSD; could be would be worth checking them

I also (wondered aloud) about a ramdisk instead of physical disk, because this is all ephemeral data. I'm thinking now this isn't very helpful, because matching the 40G on we have on stage:/tmp is m4.4xlarge or r3.2xlarge, and that's before any change in data-at-rest timing (it'll take longer to sequentially upload from scl3, before running post_upload).
Flags: needinfo?(rail)
(In reply to Nick Thomas [:nthomas] from comment #0)
> Rail, during the S3 migration meeting today we wondered
> * are you selecting based on IOps, or network bandwidth ?

Network bandwidth. IOps can be adjusted if needed (you can use IOps-optimized EBS).

> * the newer m4 instances have enhanced networking, but EBS instead of SSD;
> could be would be worth checking them

I believe you can choose SSD (it's still EBS though). I don't have any strong opinion about m3 vs m4. It's mostly about prices. In worse case scenario we can switch from m3 to m4. It will require instance shutdown, but no migration needed.

> I also (wondered aloud) about a ramdisk instead of physical disk, because
> this is all ephemeral data. I'm thinking now this isn't very helpful,
> because matching the 40G on we have on stage:/tmp is m4.4xlarge or
> r3.2xlarge, and that's before any change in data-at-rest timing (it'll take
> longer to sequentially upload from scl3, before running post_upload).

This is actually a great idea. If you have enough RAM we can definitely use tmpfs as $TMP.
Flags: needinfo?(rail)
The price difference between m3.2xlarge and m4.2xlarge is minimal ($0.532/hr vs $0.502/hr in us-west-2), and the former has SSD disk. I can't check what we have on upload.ffxbld.productdelivery.prod.mozaws.net right now, but on the stage equivalent it's a t2.medium with what must be an EBS disk. The m3's have varying amount of SSD, and we may want to downgrade the instance later on as we stop uploading this way, so it may be better to go with an m4 for the optmized EBS. Does that sound right ?

re tmpfs, probably too late at this point to re-engineer the upload host. Looks like we'd end up spending more on the instance too, if we want to match the 40G of space in stage:/tmp. Typically we are using a lot less than that, maybe just a few GB at any time, but if something goes wrong the space gets used up quickly.
Our AMI mounts ephemeral storage to /media/ephemeral0/. Where in /tmp are you uploading now and can it be changed?
Flags: needinfo?(nthomas)
Most things are doing a 'mktemp -d', via https://dxr.mozilla.org/mozilla-central/source/build/upload.py#208. There are a few places were that is reimplemented we'd need to track down too. In theory we could modify that to be 'mktemp -d -p <somepath>'.
Flags: needinfo?(nthomas)
Does the question implies that local SSD would be much preferable over EBS/optimised-EBS ?
Blocks: 1213790
I'm going to update ffxbld hosts to m4.xlarge, which has high performance networking. Let's see if this improves speed enough.
Prod and stage and now m4.xlarge. Let's reopen if we need to change again.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.