Closed Bug 949740 Opened 6 years ago Closed 6 years ago

Intermittent Android x86 We have not been able to establish a telnet connection with the emulator

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set

Tracking

(firefox27 unaffected, firefox28 unaffected, firefox29 fixed, firefox-esr24 unaffected)

RESOLVED FIXED
Tracking Status
firefox27 --- unaffected
firefox28 --- unaffected
firefox29 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

Attachments

(3 files, 1 obsolete file)

Consider: 

https://tbpl.mozilla.org/?tree=Cedar&showall=1&rev=5dc615915eb6

There are 3 instances of this error in 22 runs of S4.

We can probably live with that, but I wonder if we could do better.
15:15:30     INFO - #####
15:15:30     INFO - ##### Running start-emulators step.
15:15:30     INFO - #####
15:15:30     INFO - Running pre-action listener: _resource_record_pre_action
15:15:30     INFO - Running main action method: start_emulators
15:15:30     INFO - Let's kill every process called compiz
15:15:30     INFO - Killing pid 2393.
15:15:30     INFO - Created temp file /tmp/tmpzo7j_0.
15:15:30     INFO - Trying to start the emulator with this command: emulator -avd test-1 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5554 -qemu -m 1024 -enable-kvm
15:15:30     INFO - Sleeping 10 seconds
15:15:40     INFO -   Attempt #1 to redirect ports: (5554, 20701, 20700)
15:15:40     INFO - Trying again after exception: [Errno 111] Connection refused
15:15:40     INFO - Sleeping 30 seconds
15:16:10     INFO -   Attempt #2 to redirect ports: (5554, 20701, 20700)
15:16:10     INFO - Trying again after exception: [Errno 111] Connection refused
15:16:10     INFO - Sleeping 30 seconds
15:16:40     INFO -   Attempt #3 to redirect ports: (5554, 20701, 20700)
15:16:40     INFO - Trying again after exception: [Errno 111] Connection refused
15:16:40     INFO - Sleeping 30 seconds
15:17:10     INFO -   Attempt #4 to redirect ports: (5554, 20701, 20700)
15:17:10     INFO - Trying again after exception: [Errno 111] Connection refused
15:17:10     INFO - Sleeping 30 seconds
15:17:40     INFO -   Attempt #5 to redirect ports: (5554, 20701, 20700)
15:17:40     INFO - Trying again after exception: [Errno 111] Connection refused
15:17:40    FATAL - We have not been able to establish a telnet connection with the emulator
15:17:40    FATAL - Running post_fatal callback...
15:17:40     INFO - Let's kill every process called emulator64-x86
15:17:40     INFO - Killing pid 2718.
15:17:40    FATAL - Exiting -1
In comparison, this seems typical on success:

09:23:19     INFO - #####
09:23:19     INFO - ##### Running start-emulators step.
09:23:19     INFO - #####
09:23:19     INFO - Running pre-action listener: _resource_record_pre_action
09:23:19     INFO - Running main action method: start_emulators
09:23:19     INFO - Let's kill every process called compiz
09:23:19     INFO - Killing pid 2415.
09:23:19     INFO - Created temp file /tmp/tmpgzGPNu.
09:23:19     INFO - Trying to start the emulator with this command: emulator -avd test-1 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5554 -qemu -m 1024 -enable-kvm
09:23:19     INFO - Sleeping 10 seconds
09:23:29     INFO -   Attempt #1 to redirect ports: (5554, 20701, 20700)
09:23:29     INFO - test-1: 5554; sut port: 20701/20700
09:23:29     INFO - Created temp file /tmp/tmpXjYJmL.
09:23:29     INFO - Trying to start the emulator with this command: emulator -avd test-2 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5556 -qemu -m 1024 -enable-kvm
09:23:29     INFO - Sleeping 10 seconds
09:23:39     INFO -   Attempt #1 to redirect ports: (5556, 20703, 20702)
09:23:39     INFO - test-2: 5556; sut port: 20703/20702
09:23:39     INFO - Created temp file /tmp/tmpqkRu5X.
09:23:39     INFO - Trying to start the emulator with this command: emulator -avd test-3 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5558 -qemu -m 1024 -enable-kvm
09:23:39     INFO - Sleeping 10 seconds
09:23:49     INFO -   Attempt #1 to redirect ports: (5558, 20705, 20704)
09:23:49     INFO - test-3: 5558; sut port: 20705/20704
09:23:49     INFO - Created temp file /tmp/tmp49YwFe.
09:23:49     INFO - Trying to start the emulator with this command: emulator -avd test-4 -debug init,console,gles,memcheck,adbserver,adbclient,adb,avd_config,socket -port 5560 -qemu -m 1024 -enable-kvm
09:23:49     INFO - Sleeping 10 seconds
09:23:59     INFO -   Attempt #1 to redirect ports: (5560, 20707, 20706)
09:23:59     INFO - test-4: 5560; sut port: 20707/20706
Blocks: 891959
I wonder if the emulator log would show anything interesting in this case: let's dump it.
Attachment #8347499 - Flags: review?(armenzg)
Attachment #8347499 - Flags: review?(armenzg) → review+
That first patch didn't quite work: https://tbpl.mozilla.org/php/getParsedLog.php?id=32032518&tree=Cedar&full=1#error1

That's because emulator_procs has not been populated yet. A little re-arranging of the logic should get this working.
Attachment #8348234 - Flags: review?(armenzg)
oops - attached the wrong patch
Attachment #8348234 - Attachment is obsolete: true
Attachment #8348234 - Flags: review?(armenzg)
Attachment #8348239 - Flags: review?(armenzg)
Attachment #8347499 - Attachment description: dump emulator log if _redirectSUT fails → (1) dump emulator log if _redirectSUT fails
Attachment #8348239 - Flags: review?(armenzg) → review+
The failed emulator log (comment 8) shows:

...                 ioctl(KVM_CREATE_VM) failed: Interrupted system call
12:32:10     INFO - ko:failed to initialize KVM
in production
Blocks: 936226
No longer blocks: 891959
I checked the permissions on /dev/kvm; for both the passing and failing cases, permissions are the same:

crw-rw----+ 1 root kvm 10, 232 Dec 18 14:05 /dev/kvm
Delaying the launch of the emulators can virtually eliminate this error. Normally I am seeing the emulator launch fail about 1 time in 10. If we sleep for 3 minutes before trying to launch the emulators, this still happens, but only ~1 time in 100. If the sleep is reduced to 1 minute, I see about 1 failure in 50. Below 1 minute, failures increase significantly.
In light of comment 13, I thought I would try retrying: If an emulator is launched but cannot be contacted for port redirection, kill any running emulators, wait for a while, and then try launching the emulators again. A pause of 30 seconds between retries and up to 3 retries seems to effectively resolve this bug.

Here is an example where we fail (with the ioctl(KVM_CREATE_VM) message), retry, and succeed:

https://tbpl.mozilla.org/php/getParsedLog.php?id=32254143&tree=Ash&full=1
Attachment #8350759 - Flags: review?(armenzg)
Comment on attachment 8350759 [details] [diff] [review]
(3) retry as described in Comment 14

Review of attachment 8350759 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/android_emulator_unittest.py
@@ +471,5 @@
>          self._kill_processes("compiz")
> +
> +        attempts = 0
> +        redirect_failed = True
> +        while attempts < 3 and redirect_failed:

Mind adding a comment on this while block to help future readers know what we're doing.
Attachment #8350759 - Flags: review?(armenzg) → review+
Merged mozharness.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
See Also: → 1109932
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.