Closed Bug 1520837 Opened 6 years ago Closed 3 years ago

i3.metal worker for deepspeech taskcluster

Categories

(Infrastructure & Operations :: RelOps: Posix OS, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

Deepspeech tests on m5d.2xlarge instances are unreliable. Without kvm, the qemu emulator (android emu) is slow. (In some cases it is slow enough that tests cannot complete?)

in #taskcluster, :gerard-majax, requested we try on i3.metal to see if kvm is available.

With the transition of worker management, RelOps was asked to do the setup. But we're not able to access the taskcluster aws account to create a new AMI for the worker type.

  1. I created a new workerType through the taskcluster tools ui, and re-using the deepspeech-worker amis (created by pmoore?) but on the i3.metal instance type. This partly works and the container can see that the cpus have virtualization enabled, but it does not expose /dev/kvm to the container. Docker-worker may need to be changed to allow mounting a device or running as privileged for kvm to be used by the contained qemu (android emu). The workerType is set up with 0-5 spot instances and max cost of $4.50. The cost has been stable around $1.50 over the last 3 months.
    bug 1520824 with taskcluster, is for gaining access to /dev/kvm from the container.

  2. The i3 instances have 36 cpus, 72 cores, and so we could run more than one worker at a time on them.

Depends on: 1520824
Summary: kvm for deepspeech taskcluster → i3.metal worker for deepspeech taskcluster

https://tools.taskcluster.net/aws-provisioner/deepspeech-kvm-worker

Example task with errors finding kvm https://taskcluster-artifacts.net/fr-SDmzCQrSmlTPb-8hGIQ/0/public/logs/live_backing.log

emulator: CPU Acceleration: DISABLED
emulator: CPU Acceleration status: /dev/kvm is not found: VT disabled in BIOS or KVM kernel module not loaded
emulator: ERROR: x86_64 emulation currently requires hardware acceleration!
Please ensure KVM is properly installed and usable.
CPU acceleration status: /dev/kvm is not found: VT disabled in BIOS or KVM kernel module not loaded
* daemon started successfully

Test of accessing kvm from inside a container (showing valid kvm access when privileged or with /dev/kvm mounted but with some warnings because I tested on an older cpu):

[david@george docker]$ docker run --rm -ti kvmtest 
Could not access KVM kernel module: No such file or directory
qemu-system-x86_64: failed to initialize KVM: No such file or directory
[david@george docker]$ docker run --privileged --rm -ti kvmtest 
QEMU 2.11.1 monitor - type 'help' for more information
(qemu) qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
info kvm
kvm support: enabled
(qemu) qemu-system-x86_64: terminating on signal 15 from pid 8 (timeout)
[david@george docker]$ docker run --device /dev/kvm:/dev/kvm --rm -ti kvmtest 
QEMU 2.11.1 monitor - type 'help' for more information
(qemu) qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
info kvm
kvm support: enabled
(qemu) qemu-system-x86_64: terminating on signal 15 from pid 8 (timeout)
[david@george docker]$ cat Dockerfile 
FROM ubuntu:18.04

RUN apt-get -qq update \
    && apt-get -qq -y install qemu-kvm

CMD echo "info kvm" | timeout 5.0s qemu-system-x86_64 --monitor stdio --enable-kvm -vnc localhost:55000

There is a subset of tests that are completing as of this morning for deepspeech, and confirmed by irc that this is acceptable for now as taskcluster works on bug 1520824 to provide kvm to the containers.

Is this work still needed?

Flags: needinfo?(dhouse)

Based on the activity in bug 1520824, it is still requested but not implemented yet in the docker worker containers.

Flags: needinfo?(dhouse)

Nested-virtualization was enabled on gcp, explained in https://cloud.google.com/compute/docs/instances/enable-nested-virtualization-vm-instances
From https://bugzilla.mozilla.org/show_bug.cgi?id=1520824
the change to monopacker for docker-worker was merged Feb12th https://github.com/taskcluster/monopacker/pull/42

Because the bug https://bugzilla.mozilla.org/show_bug.cgi?id=1520824 was closed, I think this work is completed.

backlog cleanup

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.