Closed Bug 1578460 Opened 5 years ago Closed 4 years ago

Evaluate the viability of migrating from packet.net AWS bare metal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: wcosta, Assigned: wcosta)

References

Details

Attachments

(10 files)

Bug 1578460 - disable metal pool temporarily and default to spot 5 years ago Chris Cooper [:coop] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: re-enable geck-t/t-linux-metal worker-pool r=tomprince 5 years ago Miles Crabill [:miles] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: Enable metal instances worker-pool r=milescrabill 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: Upgrade baremetal worker pool r=tomprince 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: restore baremetal worker pool r=callek 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: enable run privileged tasks in baremetal r=jlorenzo 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: Update baremetal images r=callek 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: Fix baremetal workers capacity r=callek 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: scale volumes better in baremetal machines r=callek 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1578460: Configure baremetal with io1 disks r=Callek 4 years ago Wander Lairson Costa 47 bytes, text/x-phabricator-request		Details \| Review

Wander Lairson Costa

Assignee

Description

•

5 years ago

We have a first try push of Android tests running on AWS bare metal:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8070f42f462b059ca53e098b72f590b302adfbce

Wander Lairson Costa

Assignee

Comment 1

•

5 years ago

:coop could you please find someone that can check this try push? There are some tests failing, but it feels like isn't related to the environment they are running. My only option is :jmaher, but he is on PTO.

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 2

•

5 years ago

I did a quick comparison of the red/orange results that didn't go green on a re-run:

Red

Android 4.0 API16+ opt :: Atier2 - this is broken everywhere (bug 1540782)

Orange

Android 4.3 API16+ opt :: M-1proc(23) - intermittent (bug 1565119)
Android 4.3 API16+ pgo :: M-1proc(7) - intermittent (bug 1434744)
Android 4.3 API16+ pgo :: M-1proc(23) - intermittent (bug 1565119)
Android 4.3 API16+ debug :: R-1proc(R53) - intermittent (bug 1543639)
Android 4.3 API16+ debug :: M-1proc(20) - intermittent (bug 1576379)
Android 4.3 API16+ debug :: M-1proc(57) - intermittent (bug 1565119)
Android 4.3 API16+ debug :: R-1proctier 2 - intermittent (bug 1578043)
Android 7.0 x86-64 debug :: W(wpt2) - intermittent (bug 1577075) ***

I'm highlighting the last one -- Android 7.0 x86-64 debug :: W(wpt2) -- because despite it matching an intermittent bug, I haven't been able to get it to go green after multiple re-triggers.

The other thing we need to look at is the performance for the green/passing jobs. I'm going to loop in :bc at this point to help with that.

:bc - do you have a way to compare the results from this push (see comment #0) to historical data for each test? I paged through the Similar Jobs for a bunch of tests. They all seem to finish in about the same time, plus-or-minus a few minutes. That's hardly scientific though.

:bc - also let us know if there is a better way to structure this so you can help us more easily.

Flags: needinfo?(coop) → needinfo?(bob)

Miles Crabill [:miles]

Updated

•

5 years ago

Comment 4

•

5 years ago

Looking now. Sorry for the delay.

Geoff Brown [:gbrown]

Comment 5

•

5 years ago

Please note that the Android 4.3 tests are obsolete -- we no longer run those tests on trunk. (Also, those have always run on aws, not packet.net!)

Chris Cooper [:coop] (he/him)

Comment 6

•

5 years ago

(In reply to Geoff Brown [:gbrown] from comment #5)

Please note that the Android 4.3 tests are obsolete -- we no longer run those tests on trunk. (Also, those have always run on aws, not packet.net!)

It looks like Wander pushed with all the Android jobs using a fuzzy query, rather than triggering just the packet.net-specific tests. Sorry about that.

Bob Clary [:bc] (inactive)

Comment 7

•

5 years ago

I had waited too long and the logs expired. wcosta did another push to try for me and I added the tests again to have a total of 5 jobs for each test. I also did a try push from autoland e859a5aebb5b with 5 jobs each for the tests.

aws-metal try push.

packetnet push to try from autoland

The test results are comparable. 5 is a pretty low number for comparison but no new failures appeared and it appears that aws-metal may be somewhat better.

Working on timing now. First impression is aws-metal did take much longer but I'll have concrete numbers soon.

Bob Clary [:bc] (inactive)

Comment 8

•

5 years ago

Remember this is with 5 runs of each test:
https://docs.google.com/spreadsheets/d/1u2I4cf97aGTouAimeC2v-pL-QJPKJaOBgZcvY-FlNx4/edit#gid=926559078

aws metal takes 14 more hours elapsed time to run than packet.net

Flags: needinfo?(bob)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

5 years ago

Blocks: 1574478

Chris Cooper [:coop] (he/him)

Comment 9

•

5 years ago

Discussed this with :miles and :wcosta yesterday. Here's our current state:

Miles was going to revisit the monopacker images we were using before. It's likely they'll need some tweaks to work in the new Firefox CI cluster.
To use those images, we need to recreate the AWS bare metal pool in the new Firefox CI cluster. We should file a bug with releng to get that going.
Once we have the instances running, we'll want to re-validate that the tests still work. Bob's mach try fuzzy logic from comment #7 can help with this.
Once we know things are still working, we'll want to up the concurrency of workers per instance. We're running with 4 workers/instance in packet.net, so that's the first target.
Try scaling up the concurrency as far as we can get away with. Our packet.net instances only have 4 cores, whereas something like c5.metal in AWS has 96 vCPUs. While we probable can't get away with a concurrency of 96, greater than 4 seems likely.
Based on our final concurrency numbers, we should evaluate whether the migration makes sense cost-wise. In fact, we can calculate our cost efficiency threshold per AWS instance type at any point. We know what we pay right now per instance in packet.net and we can use the AWS calculator to ballpark what our break-even point needs to be in terms of concurrency. Note that this doesn't need to be a perfect mapping; I'm willing to pay (slightly) more in AWS if it means eliminating the management headache of packet.net.

:miles, :wcosta - can I get you to turn that list into dependent bugs and action items, please?

Flags: needinfo?(wcosta)

Flags: needinfo?(miles)

Miles Crabill [:miles]

Comment 10

•

5 years ago

I've confirmed that our recent images are working on metal instances. I baked some fresh images here:

us-east-1: ami-0ccb2b4c4a0694e01
us-west-1: ami-0601039bc3bee6675
us-west-2: ami-00885e2f54435445e

I'll create blocker bugs to this one.

Flags: needinfo?(wcosta)

Flags: needinfo?(miles)

Miles Crabill [:miles]

Updated

•

5 years ago

Depends on: 1596948

Wander Lairson Costa

Assignee

Comment 11

•

5 years ago

Last week I tried another approach. I instanciated a bare metal machine by hand and ran the same scripts I used to provision packet. That would make identical instances configuration. Unfortunately, the kernel in bare metal doesn't ship with snd-aloop and V4L2, necessary to run docker-worker.

Chris Cooper [:coop] (he/him)

Comment 12

•

5 years ago

Attached file Bug 1578460 - disable metal pool temporarily and default to spot — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → coop

Chris Cooper [:coop] (he/him)

Updated

•

5 years ago

Assignee: coop → nobody

Miles Crabill [:miles]

Comment 13

•

5 years ago

Attached file Bug 1578460: re-enable geck-t/t-linux-metal worker-pool r=tomprince — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → miles

Chris Cooper [:coop] (he/him)

Comment 14

•

4 years ago

Here's the recent push that Wander did:

https://treeherder.mozilla.org/#/jobs?repo=try&searchStr=geckoview%2Candroid%2C7.0&revision=af5b54c22e150b1dcde4137e06b0b7ecbb0f56eb&selectedJob=279000057

He has helpfully collated the perf results here:

https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=0

Note that's a comparison between two single runs, but illustrates that overall perf is not that different (~5%).

Wander Lairson Costa

Assignee

Comment 15

•

4 years ago

Attached file Bug 1578460: Enable metal instances worker-pool r=milescrabill — Details

Chris Cooper [:coop] (he/him)

Updated

•

4 years ago

Assignee: miles → wcosta

Status: NEW → ASSIGNED

Wander Lairson Costa

Assignee

Comment 16

•

4 years ago

Attached file Bug 1578460: Upgrade baremetal worker pool r=tomprince — Details

We use m5.metal and c5.metal instance types, since their availaibiliy in
spot market is higher than the 5d counter parts.

m5 has more RAM compared to c5, so it can afford more parallel tasks.

Wander Lairson Costa

Assignee

Comment 17

•

4 years ago

Attached file Bug 1578460: restore baremetal worker pool r=callek — Details

This uses monopacker AWS images without an official CoT key.

Wander Lairson Costa

Assignee

Comment 18

•

4 years ago

Attached file Bug 1578460: enable run privileged tasks in baremetal r=jlorenzo — Details

GeckoView tasks require privileged containers.

Geoff Brown [:gbrown]

Updated

•

4 years ago

Comment 19

•

4 years ago

(In reply to Wander Lairson Costa from comment #18)

Created attachment 9126693 [details]
Bug 1578460: enable run privileged tasks in baremetal r=jlorenzo

GeckoView tasks require privileged containers.

I would very strongly prefer not having any workers with privileged enabled in the firefox-ci cluster

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 20

•

4 years ago

(In reply to Tom Prince [:tomprince] from comment #19)

(In reply to Wander Lairson Costa from comment #18)

Created attachment 9126693 [details]
Bug 1578460: enable run privileged tasks in baremetal r=jlorenzo

GeckoView tasks require privileged containers.

I would very strongly prefer not having any workers with privileged enabled in the firefox-ci cluster

privileged tasks are a requirement for GeckoView to run the emulator correctly, that's how they have been running in packet since the beginning. The alternative is to move them to generic-worker.

Flags: needinfo?(wcosta)

Chris Cooper [:coop] (he/him)

Comment 21

•

4 years ago

(In reply to Wander Lairson Costa from comment #20)

privileged tasks are a requirement for GeckoView to run the emulator correctly, that's how they have been running in packet since the beginning. The alternative is to move them to generic-worker.

As Wander indicates, we are already running privileged tasks in docker for GeckoView in packet.net. I'm going to argue that moving all workloads from packet.net back into AWS is a net security win because it's one less cloud provider to secure.

We can certainly look at improving the security story of these tasks as a second step. Running them using task user separation under generic-worker is an obvious choice, but we could also experiment with podman, rkt, etc.

Geoff Brown [:gbrown]

Comment 22

•

4 years ago

Are we still blocked on privileged mode?

I think I introduced use of privileged at packet.net specifically to access the kvm device from the docker container -- necessary for hardware acceleration of the android x86 emulator. kvm is essential for the emulator used for all the geckoview tests. However, :snorp points out that, in modern versions of docker, the --device argument should be able to provide kvm access without --privileged. A way forward?

Chris Cooper [:coop] (he/him)

Comment 23

•

4 years ago

(In reply to Geoff Brown [:gbrown] from comment #22)

Are we still blocked on privileged mode?

We shouldn't be, per comment #21.

Wander: what are we blocked on? Do you have an update?

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 24

•

4 years ago

(In reply to Geoff Brown [:gbrown] from comment #22)

Are we still blocked on privileged mode?

I think I introduced use of privileged at packet.net specifically to access the kvm device from the docker container -- necessary for hardware acceleration of the android x86 emulator. kvm is essential for the emulator used for all the geckoview tests. However, :snorp points out that, in modern versions of docker, the --device argument should be able to provide kvm access without --privileged. A way forward?

We recently added support for /dev/kvm in docker-worker. I will investigate the feasibly of removing the privileged requirement.

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 25

•

4 years ago

Attached file Bug 1578460: Update baremetal images r=callek — Details

These images adds a bunch of fixes and improvements for baremetal
machines.

Wander Lairson Costa

Assignee

Comment 26

•

4 years ago

Attached file Bug 1578460: Fix baremetal workers capacity r=callek — Details

This updates the worker pool to allow workers to run on full capacity.

Wander Lairson Costa

Assignee

Comment 27

•

4 years ago

Attached file Bug 1578460: scale volumes better in baremetal machines r=callek — Details

baremetal machines run up to 36 tasks in parallel, and measurements show
it imposes a bottleneck on disk writes. To mitigate this, instead of
configuration one single disk, we create the volume with several smaller
disks.

Wander Lairson Costa

Assignee

Updated

•

4 years ago

Depends on: 1624642

Wander Lairson Costa

Assignee

Updated

•

4 years ago

Depends on: 1624649

Wander Lairson Costa

Assignee

Comment 28

•

4 years ago

Attached file Bug 1578460: Configure baremetal with io1 disks r=Callek — Details

We [cmr]5.metal instance types with io1 because gp2 doesn't scale with
multiple parallel tasks. We also add [cmr]5d.metal instances which
already ship with SSD disks and don't need custom disk setup.

We decrease the number of tasks to 24 per instance to make sure we don't
have I/O bottleneck. In the future we should investigate what the
optimal number of tasks for each instance.

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

4 years ago

as this bug describes evaluate, what other criteria do we need to confirm before discussing moving from packet to aws?

Chris Cooper [:coop] (he/him)

Comment 30

•

4 years ago

•

Edited

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #29)

as this bug describes evaluate, what other criteria do we need to confirm before discussing moving from packet to aws?

We're currently working to find metal instance variants that are:

a) generally available on spot in regions we support;
b) have adequate performance relative to packet.net; and,
c) cost around the same amount per hour per worker as in packet.net

Until recently, a) was the primary concern, but we've now by using multiple metal instance types, we've moved past that hurdle.

Recall from https://bugzilla.mozilla.org/show_bug.cgi?id=1599144#c2 that in packet.net we're paying $0.103/hour/worker. Due to the nature of the instances there, that's a static cost that makes it easy to compare.

Using the new AWS calculator, I priced out the current config with the io1-optimized storage. Using the current config, I come up with a cost of $0.279/hour/worker for the m5.metal type in us-east-1. ($4,880.94 instance/month / 730 hours/month / 24 workers/instance)

So the current config is a non-starter from a cost perspective. I've asked Wander to use the calculator to create a short list of configs that would work from a cost standpoint. Decreasing the # of workers per instance is one way we could achieve this and might preclude the need for special iops configs.

e.g. If we used a m5.metal instance with a slightly larger gp2 disk (1000GB), we would only need to run 8 workers/instance to bring the cost down to $0.098/hour/worker. ($570.94 instance/month / 730 hours/month / 8 workers/instance) Does performance suffer with a gp2 disk and 8 workers? We'd need to test to find out.

Once Wander has a list of viable configs from a cost perspective, he can quickly iterate through them on Try and find the best one.

Wander Lairson Costa

Assignee

Comment 31

•

4 years ago

•

Edited

This is my first try on reducing costs:

28 tasks, 1000GB io1, 28000 Iops
Cost ~ 0.177 (per worker per hour)

Try push: https://treeherder-taskcluster-staging.herokuapp.com/#/jobs?repo=try&revision=10ea99e97f36c334f5054b79e015aa23247e9e08&selectedJob=5166

When this finishes I will run another configuration.

Wander Lairson Costa

Assignee

Comment 32

•

4 years ago

•

Edited

8 tasks, 400GB gp2
Cost ~ 0.295
Try push: https://treeherder-taskcluster-staging.herokuapp.com/#/jobs?repo=try&revision=95ba99ce00ea35b7d7597a025d72aedc2e1e0000&selectedJob=5333

Wander Lairson Costa

Assignee

Comment 33

•

4 years ago

As the spot prices fluctuate, I edited the costs with a more conservative estimate of 50% discount in the spot market.

Wander Lairson Costa

Assignee

Comment 34

•

4 years ago

My last try push was with 28 workers, 1000GB io1, 16000 Iops
Costs ~ 0.13

Try push: https://treeherder-taskcluster-staging.herokuapp.com/#/jobs?repo=try&revision=43fad293337de6a0f1b85a7ca7d6424dfa8eaac5

I updated the performance spreadsheet with this push: https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=311893925

tl;dr: we are significantly worse in terms of performance compared with [cmr]5d.metal instances we tried in december

Chris Cooper [:coop] (he/him)

Comment 35

•

4 years ago

•

Edited

(In reply to Wander Lairson Costa from comment #34)

tl;dr: we are significantly worse in terms of performance compared with [cmr]5d.metal instances we tried in december

Remind me again why we can't use those instance types? That doesn't seem to be documented in the bug anywhere.

Is it due to lack of of availability? I guess that's comment #16.

Wander Lairson Costa

Assignee

Comment 36

•

4 years ago

•

Edited

(In reply to Chris Cooper [:coop] pronoun: he from comment #35)

(In reply to Wander Lairson Costa from comment #34)

tl;dr: we are significantly worse in terms of performance compared with [cmr]5d.metal instances we tried in december

Remind me again why we can't use those instance types? That doesn't seem to be documented in the bug anywhere.

Is it due to lack of of availability? I guess that's comment #16.

Exactly. But bear in mind that the comparison isn't totally fair. The machines in packet are running for a long time, so they have docker image, hg repo, tooltool downloads all cached, while AWS instances have to download them.

Chris Cooper [:coop] (he/him)

Comment 37

•

4 years ago

(In reply to Wander Lairson Costa from comment #36)

Exactly. But bear in mind that the comparison isn't totally fair. The machines in packet are running for a long time, so they have docker image, hg repo, tooltool downloads all cached, while AWS instances have to download them.

That's also why I'm a little flexible on cost, though. The AWS instances will get shutdown when not in use so we won't be paying for them 24/7 like we are for packet.net.

Chris Cooper [:coop] (he/him)

Comment 38

•

4 years ago

(In reply to Wander Lairson Costa from comment #34)

I updated the performance spreadsheet with this push: https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=311893925

You've run a bunch of different configs now. Which config was this updated performance data for? We want to to track the performance of all the configs so we can know which configs are promising and which are non-starters. This will avoid re-doing work in a few weeks because we didn't write it down.

I would also suggest trying again with the [cmr]5d.metal instances in all the regions we support.

Wander Lairson Costa

Assignee

Comment 39

•

4 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #38)

(In reply to Wander Lairson Costa from comment #34)

I updated the performance spreadsheet with this push: https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=311893925

You've run a bunch of different configs now. Which config was this updated performance data for? We want to to track the performance of all the configs so we can know which configs are promising and which are non-starters. This will avoid re-doing work in a few weeks because we didn't write it down.

I would also suggest trying again with the [cmr]5d.metal instances in all the regions we support.

It was for the latest config, which had the best prices. I am starting to look in how to automate these tests, it is getting too time-consuming doing it manually.

Wander Lairson Costa

Assignee

Comment 40

•

4 years ago

I created a new spreadsheet with more diverse and precise data on workers performance and costs

The performance data are measured in minutes and the costs are based on the average spot market prices from the last 30 days.

Wander Lairson Costa

Assignee

Updated

•

4 years ago

Depends on: 1631049

Wander Lairson Costa

Assignee

Comment 41

•

4 years ago

Last Tuesday we had the idea of using part of the RAM available as an in-memory disk and turns out it worked (with patches to monopacker and ci-admin). You can check the results in the performance spreadsheet. In summary, we could get r5.metal running 32 parallel tasks with an hourly cost of ~ $0.07 and m5.metal running 15 tasks with an hourly cost of ~ $0.14. I am going to close this bug and open a new one to track migration progress.

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Wander Lairson Costa

Assignee

Updated

•

4 years ago