Closed Bug 1417024 Opened 8 years ago Closed 8 years ago

Adding deepspeech-gpu instances

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gerard-majax, Unassigned)

Details

To perform testing with GPU-enabled, we would need: "instanceTypes": [ { "instanceType": "p2.xlarge", "capacity": 1, "utility": 1, "secrets": {}, "scopes": [], "userData": { "isolatedContainers": true }, "launchSpec": { "BlockDeviceMappings": [ { "DeviceName": "/dev/xvdb", "Ebs": { "DeleteOnTermination": true, "VolumeSize": 120, "VolumeType": "gp2" } } ] } } ],
I've created aws-provisioner/deepspeech-gpu as a copy of aws-provisioner/deepspeech-worker but with the updated instanceTypes definition above from comment 0 and an increased maxPrice of 0.6 (deepspeech-worker had 0.2).
Note, you'll need to set up one or more new clients like https://tools.taskcluster.net/auth/clients/project%2Fdeepspeech%2Fworker%2Fmacos-light-6 but for the new workers that will claim these tasks.
Flags: needinfo?(lissyx+mozillians)
I'm talking nonsense, these will be spawned by AWS - I was thinking these were workers you were managing yourself!
Flags: needinfo?(lissyx+mozillians)
I think this is done, can you confirm if you are able to schudule tasks ok, and if you get the ps.xlarge instance types like you should?
Flags: needinfo?(lissyx+mozillians)
Thanks, I'll take a look at that as soon as I can :)
I have a task that is starting: > [taskcluster 2017-11-27 12:05:03.100Z] Worker Node Type: p2.xlarge > [taskcluster 2017-11-27 12:05:03.100Z] Worker Type: deepspeech-gpu
Flags: needinfo?(lissyx+mozillians)
So far, trying to run |nvidia-smi| in the container fails: > + nvidia-smi > modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/3.13.0-112-generic/modules.dep.bin' > modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/3.13.0-112-generic/modules.dep.bin' > modprobe: ERROR: ../libkmod/libkmod-module.c:809 kmod_module_insert_module() could not find module by name='nvidia_340' > modprobe: ERROR: could not insert 'nvidia_340': Function not implemented > NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. :wcosta is checking that the host EC2 is properly configured to run Docker with NVIDIA.
Flags: needinfo?(wcosta)
(In reply to Alexandre LISSY :gerard-majax from comment #9) > So far, trying to run |nvidia-smi| in the container fails: > > > + nvidia-smi > > modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/3.13.0-112-generic/modules.dep.bin' > > modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/3.13.0-112-generic/modules.dep.bin' > > modprobe: ERROR: ../libkmod/libkmod-module.c:809 kmod_module_insert_module() could not find module by name='nvidia_340' > > modprobe: ERROR: could not insert 'nvidia_340': Function not implemented > > NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. > This error is gone once you install the linux-image-3.13.0-112-generic package. However, another error comes up related to grub-probe. > :wcosta is checking that the host EC2 is properly configured to run Docker > with NVIDIA. I confirmed the EC2 image deepspeech-gpu worker-type is running is just another generic instance running. No special nvidia package was installed, afaik.
Flags: needinfo?(wcosta)
Some more info: getting an EC2 image setup with CUDA driver won't be an easy task, iiuc. docker-worker EC2 instances run on Ubuntu 14.04, and nvidia-docker on 16.04. I heard migrating docker-worker to run on 16.04 is... though.
(In reply to Wander Lairson Costa [:wcosta] from comment #11) > Some more info: getting an EC2 image setup with CUDA driver won't be an easy > task, iiuc. docker-worker EC2 instances run on Ubuntu 14.04, and > nvidia-docker on 16.04. I heard migrating docker-worker to run on 16.04 > is... though. Hm, maybe we can get lucky, it seems according to this issue some people have it running on 14.04: https://github.com/NVIDIA/nvidia-docker/issues/475
If there's no way to quickly hack that on top of the current 14.04, do you know when making docker-worker to work is expected for 16.04 ?
Flags: needinfo?(wcosta)
(In reply to Alexandre LISSY :gerard-majax from comment #13) > If there's no way to quickly hack that on top of the current 14.04, do you > know when making docker-worker to work is expected for 16.04 ? I am experiencing with upgrading AWS images to 16.04 right now
Flags: needinfo?(wcosta)
Looks like I failed to follow that bug, but the long story being short, Wander spent a lot of time trying to get that working, and failed. So I guess we should just remove the worker for now, and see later when bits are ready.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(pmoore)
Resolution: --- → WONTFIX
I've deleted deepspeech-gpu worker type.
Flags: needinfo?(pmoore)
Component: Service Request → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.