Closed Bug 1529304 Opened 6 years ago Closed 2 years ago

Provide a worker with GPU (CUDA) support

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: marco, Assigned: ahal)

References

Details

(Keywords: leave-open)

Attachments

(14 files, 1 obsolete file)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

It'd be great if it was possible to run tasks for ML workloads using CUDA.

Amazon provides pre-build AMIs, e.g. https://aws.amazon.com/marketplace/pp/B077GCH38C, but looking at the nvidia-docker documentation, it looks like it would be possible to do this without a custom AMI too: https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2.

:coop, could someone from your team investigate if this is easily/not easily feasible?

Flags: needinfo?(coop)

I think Alexandre is working on this very thing -- perhaps he can help?

Flags: needinfo?(lissyx+mozillians)

I can't help, but I'm interested. In my case, this is needed to run under linux, and the last time we worked on that with :wcosta, the issue was not being able to get NVIDIA-Docker to work on the AMI to expose the device in the docker-worker container. This was maybe one and a half year ago?

Flags: needinfo?(lissyx+mozillians)

The Taskcluster team is extremely constrained for resources right now. Part of that is due to a migration of some AWS workloads to GCE. We can commit a small amount of time to reviews if someone else wants to take a stab at this, either in AWS or GCE (https://cloud.google.com/compute/docs/gpus/add-gpus).

We can help, but likely not until Q3.

Flags: needinfo?(coop)

(In reply to Chris Cooper [:coop] pronoun: he from comment #4)

The Taskcluster team is extremely constrained for resources right now. Part of that is due to a migration of some AWS workloads to GCE. We can commit a small amount of time to reviews if someone else wants to take a stab at this, either in AWS or GCE (https://cloud.google.com/compute/docs/gpus/add-gpus).

We can help, but likely not until Q3.

If we wanted to help, do you have an idea of what this would entail? How long would it take for someone outside of the Taskcluster team?

Needinfo for comment 5. We have a new contractor starting soon that might help with this, depending on the amount of work that would be required.

Flags: needinfo?(coop)

(In reply to Marco Castelluccio [:marco] from comment #6)

Needinfo for comment 5. We have a new contractor starting soon that might help with this, depending on the amount of work that would be required.

Redirecting the NI to :wcosta, but my quick take is that it will take someone to play with the docker image to get something a) working, and b) stable enough that it won't be a support nightmare. Per comment #3, maybe NVIDIA-Docker is in a more stable place after 18 months?

Flags: needinfo?(coop) → needinfo?(wcosta)

As much as I recall from the state back then, problem was just that NVIDIA-Docker required a more recent version of docker, and that it was unsure what could be the fallout / regressions related for docker-worker. Plus the need to produce a specific AMI to integrate NVIDIA-Docker.

(In reply to Alexandre LISSY :gerard-majax from comment #8)

As much as I recall from the state back then, problem was just that
NVIDIA-Docker required a more recent version of docker, and that it was
unsure what could be the fallout / regressions related for docker-worker.
Plus the need to produce a specific AMI to integrate NVIDIA-Docker.

I am currently working on upgrading EC2 image to Linux 18.04 and docker 18.09.3. Hopefully this would help to get nvidia-docker running.

Flags: needinfo?(wcosta)

We're making some improvements to the docker-worker deployment process, including being able to update the docker version. This may be possible soon.

Component: General → Workers

Is it something we could have now ?

Flags: needinfo?(wcosta)

(In reply to Alexandre LISSY :gerard-majax from comment #11)

Is it something we could have now ?

It is possible to do it now in monorepo. I can do it after baremetal migration.

Flags: needinfo?(wcosta)

(In reply to Wander Lairson Costa from comment #12)

(In reply to Alexandre LISSY :gerard-majax from comment #11)

Is it something we could have now ?

It is possible to do it now in monorepo. I can do it after baremetal migration.

Cool, do you have any idea of the ETA for that ? I have no idea of how much work this requires.

Flags: needinfo?(wcosta)

(In reply to Alexandre LISSY :gerard-majax from comment #13)

(In reply to Wander Lairson Costa from comment #12)

(In reply to Alexandre LISSY :gerard-majax from comment #11)

Is it something we could have now ?

It is possible to do it now in monorepo. I can do it after baremetal migration.

Cool, do you have any idea of the ETA for that ? I have no idea of how much work this requires.

ETA is end of second quarter.

Flags: needinfo?(wcosta)

We want to start supporting GPU enabled pools in GCP for machine learning. This
will allow us to attach the GPUs from the same zone as the instance type.

Assignee: nobody → ahal
Status: NEW → ASSIGNED
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/e536e2a60c8c Support translating 'guestAccelerators.acceleratorType' in ci-admin, r=releng-reviewers,gabriel
Attachment #9325513 - Attachment description: WIP: Bug 1529304: add translations image and workerType → WIP: Bug 1529304: add translations image, provider, and workerType
Attachment #9322786 - Attachment is obsolete: true
Attachment #9325513 - Attachment description: WIP: Bug 1529304: add translations image, provider, and workerType → Bug 1529304: add translations image, provider, and workerType
Attachment #9325513 - Attachment description: Bug 1529304: add translations image, provider, and workerType → WIP: Bug 1529304: add translations image, provider, and workerType
Attachment #9325513 - Attachment description: WIP: Bug 1529304: add translations image, provider, and workerType → Bug 1529304: add translations image, provider, and workerType
Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/cf3796a301f0 add translations image, provider, and workerType r=releng-reviewers,bhearsum
Pushed by aerickson@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/ace1c740c517 add translations image, provider, and workerType. round 2 r=bhearsum,releng-reviewers
Pushed by aerickson@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/085da960ddde translations gpu workers: relax cpu platform restriction r=bhearsum,releng-reviewers
Pushed by aerickson@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/9d5290f431b3 reduce instance size to work with v100 gpu r=bhearsum,releng-reviewers
Attachment #9330732 - Attachment description: Bug 1529304: add g-w linux implementation, use for translations gpu workers r=bhearsum → Bug 1529304: translations gpu workers tweaks r=bhearsum
Attachment #9330732 - Attachment description: Bug 1529304: translations gpu workers tweaks r=bhearsum → Bug 1529304: translations gpu worker tweaks r=bhearsum
Attachment #9330732 - Attachment description: Bug 1529304: translations gpu worker tweaks r=bhearsum → Bug 1529304: add g-w linux implementation, use for translations gpu workers r=bhearsum
Pushed by aerickson@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/ee2e4aa3687f add g-w linux implementation, use for translations gpu workers r=bhearsum,releng-reviewers
Pushed by aerickson@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/e5364c707d28 use newer translations image with fixes, r=bhearsum

Fixes missing zstandard (zstd != zstandard).

See https://github.com/taskcluster/monopacker/pull/106 for details about the changes to the image.

Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/77ba0dc95bca use newer translation image round 2, r=bhearsum,releng-reviewers
Pushed by aerickson@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/93dfe3a710dd update translations image r3, r=bhearsum,releng-reviewers

A handful of changes here:

  • Drop the unused b-linux-gcp and t-linux-gcp worker pools.
  • Rename GPU workers to b-linux style; drop experimental comment
  • Drop the seemingly unused scratch disks from GPU workers
  • Add permanent 300gb variant for CPU & GPU workers
  • Add permanent 1tb variant for GPU workers
  • Add some temporary pools to experiment with other GPU types: p100, a100, l4

The extra disk space variants are required for certain steps that consume and/or output huge amounts of data in full training runs.

Eventually we'll settle on 4 GPU worker pool:

  • A fairly low powered one for the steps aren't terribly GPU intensive (this will probably be the existing v100 x 1 - but we'll see)
  • 75gb, 300gb, and 1TB variants of a higher powered one that gives us the best bang for our buck, determined by the results from the temporary pools above.

For now, I'm adding 300gb/1tb versions of the v100 x 4 pools to unblock production training while experiments happen.

Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/f00cffd1db70 Overhaul Firefox Translations workers r=aerickson
Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/8996b09fb238 use translations images in the taskcluster-imaging project, to work around permissions issues with the translations-sandbox project. r=aerickson
Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/b5cd715b922c use a very short idleTimeout for GPU workers r=aerickson

For lack of a better place to put it, here's a quick write-up on some experimentation I did with different instance/GPU types. I tried running the same training with the following configurations:

  • v100 x 1
  • p100 x 4
  • a100 x 1
  • l4 x 1

The a100 was by far the most expensive option, and also seems to have the most limited availability.

The l4 was slightly cheaper than v100s and p100s, but also took nearly double the amount of time to train.

The v100s and p100s similar in price - with the v100s coming out slightly lower.

With all of this is mind, I'm going to stick with v100s for now, and do some additional experiments to see the relative efficiency of 4, 8, and 16 GPUs attached to the same instance.

First pass showed that v100s are the best bet. This round will test training with 4, 8, and 16 of these GPUs.

(This diff doesn't read terribly well - this patch is simply removing the a100/p100/l4 instances, and adding a 8 & 16 GPU v100 instances.)

Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/46f7dbec4258 update translations worker pools for second round of GPU experiements r=releng-reviewers,gbrown

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #40)

For lack of a better place to put it, here's a quick write-up on some experimentation I did with different instance/GPU types. I tried running the same training with the following configurations:

  • v100 x 1
  • p100 x 4
  • a100 x 1
  • l4 x 1

The a100 was by far the most expensive option, and also seems to have the most limited availability.

The l4 was slightly cheaper than v100s and p100s, but also took nearly double the amount of time to train.

The v100s and p100s similar in price - with the v100s coming out slightly lower.

With all of this is mind, I'm going to stick with v100s for now, and do some additional experiments to see the relative efficiency of 4, 8, and 16 GPUs attached to the same instance.

It turns out that the max GPUs for v100s is 8. If we're not satisfied with this, the A100s are the next thing to try. These are both faster per GPU, and we have the possibility of using 16 GPUs/instance.

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #43)

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #40)

For lack of a better place to put it, here's a quick write-up on some experimentation I did with different instance/GPU types. I tried running the same training with the following configurations:

  • v100 x 1
  • p100 x 4
  • a100 x 1
  • l4 x 1

The a100 was by far the most expensive option, and also seems to have the most limited availability.

The l4 was slightly cheaper than v100s and p100s, but also took nearly double the amount of time to train.

The v100s and p100s similar in price - with the v100s coming out slightly lower.

With all of this is mind, I'm going to stick with v100s for now, and do some additional experiments to see the relative efficiency of 4, 8, and 16 GPUs attached to the same instance.

It turns out that the max GPUs for v100s is 8. If we're not satisfied with this, the A100s are the next thing to try. These are both faster per GPU, and we have the possibility of using 16 GPUs/instance.

I did some comparisons with 4 & 8 v100s. Somewhat surprisingly, the 4 GPU instances completed their training tasks nearly as fast as the 8 GPU instances. I did this test four times with the exact same inputs, with these results:

  • 4 x v100: 23m, 28m, 23m, 26m
  • 8 x v100: 22m, 20m, 21m, 23m

The best 8 GPU run is obviously much faster than the worst 4 GPU run - but on average it was only ~15% faster. Given this, I'm going to stick with 4 GPU instances for the moment, as it's difficult to justify doubling or GPU cost for a relative minor gain.

I would bet that there are things we can tweak in the training config or pipeline that will make better use of the available GPUs (or perhaps updating to the latest marian version?) - so we should revisit this in the future.

Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/b9b18418387d remove 8 and 16 GPU instances that we will not be using r=releng-reviewers,gbrown

We're now at a point where we can run the entire training pipeline in Taskcluster, and have settled on our initial types of instances. I think this bug has outlive its usefullness - we can file specific follow-ups if we need future changes to the VM image or workers.

Thanks to ahal for adding GPU support to ci-config, and aerickson for all the work on the new VM image.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: