i3.metal worker for deepspeech taskcluster
Categories
(Infrastructure & Operations :: RelOps: Posix OS, task)
Tracking
(Not tracked)
People
(Reporter: dhouse, Assigned: dhouse)
References
Details
Deepspeech tests on m5d.2xlarge instances are unreliable. Without kvm, the qemu emulator (android emu) is slow. (In some cases it is slow enough that tests cannot complete?)
in #taskcluster, :gerard-majax, requested we try on i3.metal to see if kvm is available.
With the transition of worker management, RelOps was asked to do the setup. But we're not able to access the taskcluster aws account to create a new AMI for the worker type.
-
I created a new workerType through the taskcluster tools ui, and re-using the deepspeech-worker amis (created by pmoore?) but on the i3.metal instance type. This partly works and the container can see that the cpus have virtualization enabled, but it does not expose /dev/kvm to the container. Docker-worker may need to be changed to allow mounting a device or running as privileged for kvm to be used by the contained qemu (android emu). The workerType is set up with 0-5 spot instances and max cost of $4.50. The cost has been stable around $1.50 over the last 3 months.
bug 1520824 with taskcluster, is for gaining access to /dev/kvm from the container. -
The i3 instances have 36 cpus, 72 cores, and so we could run more than one worker at a time on them.
https://tools.taskcluster.net/aws-provisioner/deepspeech-kvm-worker
Example task with errors finding kvm https://taskcluster-artifacts.net/fr-SDmzCQrSmlTPb-8hGIQ/0/public/logs/live_backing.log
emulator: CPU Acceleration: DISABLED
emulator: CPU Acceleration status: /dev/kvm is not found: VT disabled in BIOS or KVM kernel module not loaded
emulator: ERROR: x86_64 emulation currently requires hardware acceleration!
Please ensure KVM is properly installed and usable.
CPU acceleration status: /dev/kvm is not found: VT disabled in BIOS or KVM kernel module not loaded
* daemon started successfully
Test of accessing kvm from inside a container (showing valid kvm access when privileged or with /dev/kvm mounted but with some warnings because I tested on an older cpu):
[david@george docker]$ docker run --rm -ti kvmtest
Could not access KVM kernel module: No such file or directory
qemu-system-x86_64: failed to initialize KVM: No such file or directory
[david@george docker]$ docker run --privileged --rm -ti kvmtest
QEMU 2.11.1 monitor - type 'help' for more information
(qemu) qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
info kvm
kvm support: enabled
(qemu) qemu-system-x86_64: terminating on signal 15 from pid 8 (timeout)
[david@george docker]$ docker run --device /dev/kvm:/dev/kvm --rm -ti kvmtest
QEMU 2.11.1 monitor - type 'help' for more information
(qemu) qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
info kvm
kvm support: enabled
(qemu) qemu-system-x86_64: terminating on signal 15 from pid 8 (timeout)
[david@george docker]$ cat Dockerfile
FROM ubuntu:18.04
RUN apt-get -qq update \
&& apt-get -qq -y install qemu-kvm
CMD echo "info kvm" | timeout 5.0s qemu-system-x86_64 --monitor stdio --enable-kvm -vnc localhost:55000
There is a subset of tests that are completing as of this morning for deepspeech, and confirmed by irc that this is acceptable for now as taskcluster works on bug 1520824 to provide kvm to the containers.
Based on the activity in bug 1520824, it is still requested but not implemented yet in the docker worker containers.
Nested-virtualization was enabled on gcp, explained in https://cloud.google.com/compute/docs/instances/enable-nested-virtualization-vm-instances
From https://bugzilla.mozilla.org/show_bug.cgi?id=1520824
the change to monopacker for docker-worker was merged Feb12th https://github.com/taskcluster/monopacker/pull/42
Because the bug https://bugzilla.mozilla.org/show_bug.cgi?id=1520824 was closed, I think this work is completed.
backlog cleanup
Description
•