Closed Bug 1109932 Opened 10 years ago Closed 9 years ago

Intermittent We have not been able to establish a telnet connection with the emulator

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: KWierso, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(4 files, 5 obsolete files)

Unsure how frequent this could be, but filing it to track it in case it's more than a one-off...


15:15:05 INFO - QEMU options list:
15:15:05 INFO - emulator: argv[00] = "/tools/android-sdk18/tools/emulator64-arm"
15:15:05 INFO - emulator: argv[01] = "-android-port"
15:15:05 INFO - emulator: argv[02] = "5554"
15:15:05 INFO - emulator: argv[03] = "-m"
15:15:05 INFO - emulator: argv[04] = "1024"
15:15:05 INFO - emulator: argv[05] = "-cpu"
15:15:05 INFO - emulator: argv[06] = "cortex-a9"
15:15:05 INFO - emulator: argv[07] = "-android-hw"
15:15:05 INFO - emulator: argv[08] = "/home/cltbld/.android/avd/test-1.avd/hardware-qemu.ini"
15:15:05 INFO - Concatenated QEMU options:
15:15:05 INFO - /tools/android-sdk18/tools/emulator64-arm -android-port 5554 -m 1024 -cpu cortex-a9 -android-hw /home/cltbld/.android/avd/test-1.avd/hardware-qemu.ini
15:15:05 INFO - ##### test-1 emulator log ends
15:15:05 INFO - Let's kill every process called emulator64-arm
15:15:05 INFO - Killing pid 1380.
15:15:05 FATAL - We have not been able to establish a telnet connection with the emulator
15:15:05 FATAL - Running post_fatal callback...
15:15:05 INFO - Let's kill every process called emulator64-arm
15:15:05 INFO - Killing pid 1380.
15:15:05 FATAL - Exiting -1
15:15:05 INFO - Running post-action listener: _resource_record_post_action
15:15:05 INFO - Running post-run listener: _resource_record_post_run
15:15:05 INFO - Running post-run listener: _upload_blobber_files
15:15:05 INFO - Blob upload gear active.
15:15:05 WARNING - Blob upload directory does not exist!
program finished with exit code 255
Guess we can safely guess that if there ever is an intermittent problem with the Firefox OS emulator, rather than a problem with qemu being used as an Android emulator, there'll be no point in us filing a bug about it, if we can go three weeks and 222 comments without anyone mentioning that we've put a Firefox for Android releng bug in their product/component.
Component: Emulator → Platform Support
OS: Windows 8.1 → Linux
Product: Firefox OS → Release Engineering
QA Contact: coop
Hardware: x86_64 → x86
Depends on: 1117395
This looks to be quite a frequent failure; this bug is #2 on this week's War on Orange report.

I am inclined to ignore the Android x86 failures -- they seem less frequent, and I have seen similar, infrequent failures on x86 since we started running x86.

But the Android 2.3 failures are troubling and seem like a "new" problem -- I'm pretty sure this was not happening a few months ago. Might this be a regression from:
 - split apk
 - locale builds
 - no reboot
?

Logs show that the emulator is started, but all attempts to telnet to the emulator control port (5554) are met with Connection Refused, as though the emulator cannot completely start up. That might be related to a change in:
 - emulator binary
 - emulator system images
 - avd definitions
 - android_emulator_unittest.py
 - the python telnet library
 - aws configuration
but I don't see any likely candidates in recent change logs.

Or maybe it's another host process using port 5554?

:kmoir -- any ideas?
Flags: needinfo?(kmoir)
This became a perma-fail problem today with recent logs showing "PANIC: Could not open: test-1". We think that's related to http://hg.mozilla.org/build/mozharness/rev/27e55b4b5c9a.
Attached patch more diagnostics (obsolete) — — Splinter Review
I'll push this to try and see if it tells me anything interesting.
The last time a similar issue happened it was a process that wasn't being killed.  I'm logged into a spot instance that ran into this issue with the emulator jobs recently but can't see any processes holding those ports open.  Hopefully gbrown's diagnostics will reveal the cause
Flags: needinfo?(kmoir)
I have added killing at the beginning of a run in here:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=011055026725

Since we stopped rebooting linux machines it is likely that we now see this issue.
With ps and netstat called just before each attempt to start the emulator: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b876bd03bea4. I hit bug 1118267 quite a lot, but still reproduced the emulator connect failure once: http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/gbrown@mozilla.com-b876bd03bea4/try-android-api-9/try_ubuntu64_vm_mobile_test-mochitest-11-bm115-tests1-linux64-build737.txt.gz

Before the first emulator launch, everything looks fine -- emulator is not running, no sign of anything listening on port 5554.

The first attempt fails, so we kill the emulator:

12:10:33  WARNING - failed to establish a telnet connection with the emulator
...
12:10:33     INFO - Let's kill every process called emulator64-arm
12:10:33     INFO - Killing pid 5317.

and try again...but that emulator still appears in the process list:

12:10:33     INFO -  5317 ?        00:02:08 emulator64-arm

On the next try, pid 5317 is gone but the pattern repeats with the new emulator...

12:13:13     INFO - Let's kill every process called emulator64-arm
12:13:14     INFO - Killing pid 17488.
...
12:13:14     INFO - 17488 ?        00:02:10 emulator64-arm

Am I just checking ps too soon after the kill, or is the kill not working?
I notice that the process list in the failure of Comment 789 is very different from the process list of successful runs. I have only reproduced the failure once, so this isn't conclusive, but it is curious.

Failure (Comment 789):
12:08:23     INFO -   PID TTY          TIME CMD
12:08:23     INFO -     1 ?        00:00:00 init
12:08:23     INFO -     2 ?        00:00:00 kthreadd
12:08:23     INFO -     3 ?        00:00:00 ksoftirqd/0
12:08:23     INFO -     4 ?        00:00:00 kworker/0:0
12:08:23     INFO -     5 ?        00:00:00 kworker/u:0
12:08:23     INFO -     6 ?        00:00:00 migration/0
12:08:23     INFO -     7 ?        00:00:00 watchdog/0
12:08:23     INFO -     8 ?        00:00:00 cpuset
12:08:23     INFO -     9 ?        00:00:00 khelper
12:08:23     INFO -    10 ?        00:00:00 kdevtmpfs
12:08:23     INFO -    11 ?        00:00:00 netns
12:08:23     INFO -    12 ?        00:00:00 xenwatch
12:08:23     INFO -    13 ?        00:00:00 xenbus
12:08:23     INFO -    14 ?        00:00:00 sync_supers
12:08:23     INFO -    15 ?        00:00:00 bdi-default
12:08:23     INFO -    16 ?        00:00:00 kintegrityd
12:08:23     INFO -    17 ?        00:00:00 kblockd
12:08:23     INFO -    18 ?        00:00:00 ata_sff
12:08:23     INFO -    19 ?        00:00:00 khubd
12:08:23     INFO -    20 ?        00:00:00 md
12:08:23     INFO -    21 ?        00:00:00 kworker/0:1
12:08:23     INFO -    22 ?        00:00:00 kworker/u:1
12:08:23     INFO -    24 ?        00:00:00 khungtaskd
12:08:23     INFO -    25 ?        00:00:00 kswapd0
12:08:23     INFO -    26 ?        00:00:00 ksmd
12:08:23     INFO -    27 ?        00:00:00 fsnotify_mark
12:08:23     INFO -    28 ?        00:00:00 ecryptfs-kthrea
12:08:23     INFO -    29 ?        00:00:00 crypto
12:08:23     INFO -    37 ?        00:00:00 kthrotld
12:08:23     INFO -    38 ?        00:00:00 khvcd
12:08:23     INFO -    57 ?        00:00:00 devfreq_wq
12:08:23     INFO -   150 ?        00:00:01 jbd2/xvda1-8
12:08:23     INFO -   151 ?        00:00:00 ext4-dio-unwrit
12:08:23     INFO -   238 ?        00:00:00 upstart-udev-br
12:08:23     INFO -   243 ?        00:00:00 udevd
12:08:23     INFO -   342 ?        00:00:00 udevd
12:08:23     INFO -   343 ?        00:00:00 udevd
12:08:23     INFO -   488 ?        00:00:00 upstart-socket-
12:08:23     INFO -   560 ?        00:00:00 dhclient3
12:08:23     INFO -   689 ?        00:00:00 sh
12:08:23     INFO -   715 ?        00:00:00 sshd
12:08:23     INFO -   719 ?        00:00:00 dbus-daemon
12:08:23     INFO -   734 ?        00:00:00 rsyslogd
12:08:23     INFO -   754 tty4     00:00:00 getty
12:08:23     INFO -   756 tty5     00:00:00 getty
12:08:23     INFO -   761 tty2     00:00:00 getty
12:08:23     INFO -   762 tty3     00:00:00 getty
12:08:23     INFO -   764 tty6     00:00:00 getty
12:08:23     INFO -   769 ?        00:00:00 cron
12:08:23     INFO -   812 ?        00:00:00 apache2
12:08:23     INFO -   814 ?        00:00:00 apache2
12:08:23     INFO -   816 ?        00:00:00 apache2
12:08:23     INFO -   817 ?        00:00:00 apache2
12:08:23     INFO -   890 ?        00:00:00 collectdmon
12:08:23     INFO -   892 ?        00:00:00 collectd
12:08:23     INFO -   990 tty1     00:00:00 getty
12:08:23     INFO -   995 ?        00:00:00 console-kit-dae
12:08:23     INFO -  1062 ?        00:00:00 polkitd
12:08:23     INFO -  1133 ?        00:00:00 upowerd
12:08:23     INFO -  1148 ?        00:00:00 accounts-daemon
12:08:23     INFO -  1212 ?        00:00:00 rtkit-daemon
12:08:23     INFO -  1251 ?        00:00:00 udisks-daemon
12:08:23     INFO -  1252 ?        00:00:00 udisks-daemon
12:08:23     INFO -  1485 ?        00:00:00 runner
12:08:23     INFO -  5237 ?        00:00:00 Xvfb
12:08:23     INFO -  5262 ?        00:00:00 python
12:08:23     INFO -  5275 ?        00:00:04 twistd
12:08:23     INFO -  5289 ?        00:00:00 flush-202:1
12:08:23     INFO -  5297 ?        00:00:01 python
12:08:23     INFO -  5315 ?        00:00:00 ps

Success:
http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/gbrown@mozilla.com-01d007b2ef6e/try-android-api-9/try_ubuntu64_vm_mobile_test-mochitest-3-bm52-tests1-linux64-build171.txt.gz
15:33:14     INFO -   PID TTY          TIME CMD
15:33:14     INFO -     1 ?        00:00:00 init
15:33:14     INFO -     2 ?        00:00:00 kthreadd
15:33:14     INFO -     3 ?        00:00:00 ksoftirqd/0
15:33:14     INFO -     4 ?        00:00:00 kworker/0:0
15:33:14     INFO -     5 ?        00:00:00 kworker/u:0
15:33:14     INFO -     6 ?        00:00:00 migration/0
15:33:14     INFO -     7 ?        00:00:00 watchdog/0
15:33:14     INFO -     8 ?        00:00:00 cpuset
15:33:14     INFO -     9 ?        00:00:00 khelper
15:33:14     INFO -    10 ?        00:00:00 kdevtmpfs
15:33:14     INFO -    11 ?        00:00:00 netns
15:33:14     INFO -    12 ?        00:00:00 xenwatch
15:33:14     INFO -    13 ?        00:00:00 xenbus
15:33:14     INFO -    14 ?        00:00:00 sync_supers
15:33:14     INFO -    15 ?        00:00:00 bdi-default
15:33:14     INFO -    16 ?        00:00:00 kintegrityd
15:33:14     INFO -    17 ?        00:00:00 kblockd
15:33:14     INFO -    18 ?        00:00:00 ata_sff
15:33:14     INFO -    19 ?        00:00:00 khubd
15:33:14     INFO -    20 ?        00:00:00 md
15:33:14     INFO -    21 ?        00:00:00 kworker/0:1
15:33:14     INFO -    22 ?        00:00:00 kworker/u:1
15:33:14     INFO -    24 ?        00:00:00 khungtaskd
15:33:14     INFO -    25 ?        00:00:01 kswapd0
15:33:14     INFO -    26 ?        00:00:00 ksmd
15:33:14     INFO -    27 ?        00:00:00 fsnotify_mark
15:33:14     INFO -    28 ?        00:00:00 ecryptfs-kthrea
15:33:14     INFO -    29 ?        00:00:00 crypto
15:33:14     INFO -    37 ?        00:00:00 kthrotld
15:33:14     INFO -    38 ?        00:00:00 khvcd
15:33:14     INFO -    57 ?        00:00:00 devfreq_wq
15:33:14     INFO -   150 ?        00:00:03 jbd2/xvda1-8
15:33:14     INFO -   151 ?        00:00:00 ext4-dio-unwrit
15:33:14     INFO -   245 ?        00:00:00 upstart-udev-br
15:33:14     INFO -   250 ?        00:00:00 udevd
15:33:14     INFO -   369 ?        00:00:00 udevd
15:33:14     INFO -   380 ?        00:00:00 udevd
15:33:14     INFO -   544 ?        00:00:00 upstart-socket-
15:33:14     INFO -   553 ?        00:00:00 dhclient3
15:33:14     INFO -   642 ?        00:00:07 flush-202:1
15:33:14     INFO -   650 ?        00:00:00 sh
15:33:14     INFO -   677 ?        00:00:00 sshd
15:33:14     INFO -   687 ?        00:00:00 rsyslogd
15:33:14     INFO -   696 ?        00:00:01 dbus-daemon
15:33:14     INFO -   718 tty4     00:00:00 getty
15:33:14     INFO -   720 tty5     00:00:00 getty
15:33:14     INFO -   725 tty2     00:00:00 getty
15:33:14     INFO -   726 tty3     00:00:00 getty
15:33:14     INFO -   728 tty6     00:00:00 getty
15:33:14     INFO -   733 ?        00:00:00 cron
15:33:14     INFO -   775 ?        00:00:00 apache2
15:33:14     INFO -   777 ?        00:00:00 apache2
15:33:14     INFO -   779 ?        00:00:00 apache2
15:33:14     INFO -   782 ?        00:00:00 apache2
15:33:14     INFO -   853 ?        00:00:00 collectdmon
15:33:14     INFO -   855 ?        00:00:00 collectd
15:33:14     INFO -   953 tty1     00:00:00 getty
15:33:14     INFO -   958 ?        00:00:00 console-kit-dae
15:33:14     INFO -  1025 ?        00:00:00 polkitd
15:33:14     INFO -  1096 ?        00:00:00 upowerd
15:33:14     INFO -  1120 ?        00:00:00 accounts-daemon
15:33:14     INFO -  1175 ?        00:00:00 rtkit-daemon
15:33:14     INFO -  1214 ?        00:00:00 udisks-daemon
15:33:14     INFO -  1215 ?        00:00:00 udisks-daemon
15:33:14     INFO -  1441 ?        00:00:01 runner
15:33:14     INFO -  8167 ?        00:00:01 Xvfb
15:33:14     INFO -  8168 ?        00:00:00 sh
15:33:14     INFO -  8188 ?        00:00:00 python
15:33:14     INFO -  8201 ?        00:00:04 twistd
15:33:14     INFO -  8203 ?        00:00:00 su
15:33:14     INFO -  8210 ?        00:00:00 x-session-manag
15:33:14     INFO -  8236 ?        00:00:00 ssh-agent
15:33:14     INFO -  8239 ?        00:00:00 dbus-launch
15:33:14     INFO -  8240 ?        00:00:00 dbus-daemon
15:33:14     INFO -  8249 ?        00:00:00 gnome-keyring-d
15:33:14     INFO -  8250 ?        00:00:00 gnome-settings-
15:33:14     INFO -  8261 ?        00:00:00 gvfsd
15:33:14     INFO -  8263 ?        00:00:00 gvfs-fuse-daemo
15:33:14     INFO -  8269 ?        00:00:00 metacity
15:33:14     INFO -  8275 ?        00:00:00 gconfd-2
15:33:14     INFO -  8280 ?        00:00:00 unity-2d-panel
15:33:14     INFO -  8281 ?        00:00:01 unity-2d-shell
15:33:14     INFO -  8284 ?        00:00:00 pulseaudio
15:33:14     INFO -  8287 ?        00:00:00 gconf-helper
15:33:14     INFO -  8292 ?        00:00:00 nm-applet
15:33:14     INFO -  8293 ?        00:00:00 nautilus
15:33:14     INFO -  8294 ?        00:00:00 deja-dup-monito
15:33:14     INFO -  8295 ?        00:00:00 gnome-fallback-
15:33:14     INFO -  8296 ?        00:00:00 polkit-gnome-au
15:33:14     INFO -  8300 ?        00:00:00 bamfdaemon
15:33:14     INFO -  8304 ?        00:00:00 bluetooth-apple
15:33:14     INFO -  8322 ?        00:00:00 gvfs-gdu-volume
15:33:14     INFO -  8324 ?        00:00:00 gvfs-afc-volume
15:33:14     INFO -  8327 ?        00:00:00 gvfs-gphoto2-vo
15:33:14     INFO -  8332 ?        00:00:00 unity-panel-ser
15:33:14     INFO -  8340 ?        00:00:00 gvfsd-trash
15:33:14     INFO -  8343 ?        00:00:00 indicator-appli
15:33:14     INFO -  8344 ?        00:00:00 indicator-messa
15:33:14     INFO -  8346 ?        00:00:00 indicator-datet
15:33:14     INFO -  8348 ?        00:00:00 indicator-sessi
15:33:14     INFO -  8351 ?        00:00:00 indicator-sound
15:33:14     INFO -  8352 ?        00:00:00 indicator-print
15:33:14     INFO -  8359 ?        00:00:00 dconf-service
15:33:14     INFO -  8392 ?        00:00:00 geoclue-master
15:33:14     INFO -  8395 ?        00:00:00 gvfsd-burn
15:33:14     INFO -  8402 ?        00:00:00 ubuntu-geoip-pr
15:33:14     INFO -  8419 ?        00:00:00 hud-service
15:33:14     INFO -  8421 ?        00:00:00 unity-applicati
15:33:14     INFO -  8423 ?        00:00:00 unity-files-dae
15:33:14     INFO -  8426 ?        00:00:00 unity-music-dae
15:33:14     INFO -  8428 ?        00:00:00 unity-lens-vide
15:33:14     INFO -  8456 ?        00:00:00 zeitgeist-daemo
15:33:14     INFO -  8472 ?        00:00:00 zeitgeist-fts
15:33:14     INFO -  8474 ?        00:00:00 zeitgeist-datah
15:33:14     INFO -  8480 ?        00:00:00 cat
15:33:14     INFO -  8510 ?        00:00:00 unity-musicstor
15:33:14     INFO -  8511 ?        00:00:00 unity-scope-vid
15:33:14     INFO -  8541 ?        00:00:00 gdu-notificatio
15:33:14     INFO -  8544 ?        00:00:00 telepathy-indic
15:33:14     INFO -  8551 ?        00:00:00 mission-control
15:33:14     INFO -  8556 ?        00:00:00 goa-daemon
15:33:14     INFO -  8560 ?        00:00:00 gnome-screensav
15:33:14     INFO -  8575 ?        00:00:00 update-notifier
15:33:14     INFO -  8585 ?        00:00:02 aptd
15:33:14     INFO -  8601 ?        00:00:04 python
15:33:14     INFO -  8607 ?        00:00:00 ps

The failed host is not running pulseaudio, not running gnome-screensaver, not running any unity processes.
This seems to be happening less frequently, but I did catch one more failure with diagnostics:

http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/gbrown@mozilla.com-e35a459befc5/try-android-api-9/try_ubuntu64_vm_mobile_test-mochitest-16-bm68-tests1-linux64-build517.txt.gz

Again, the initial state of the host is suspect, with pulseaudio and various daemons not running:

22:00:42     INFO - Attempt #1 to launch emulators...
22:00:42     INFO - output from ps -A follows...
22:00:42     INFO -   PID TTY          TIME CMD
22:00:42     INFO -     1 ?        00:00:00 init
22:00:42     INFO -     2 ?        00:00:00 kthreadd
22:00:42     INFO -     3 ?        00:00:00 ksoftirqd/0
22:00:42     INFO -     4 ?        00:00:00 kworker/0:0
22:00:42     INFO -     5 ?        00:00:00 kworker/u:0
22:00:42     INFO -     6 ?        00:00:00 migration/0
22:00:42     INFO -     7 ?        00:00:00 watchdog/0
22:00:42     INFO -     8 ?        00:00:00 cpuset
22:00:42     INFO -     9 ?        00:00:00 khelper
22:00:42     INFO -    10 ?        00:00:00 kdevtmpfs
22:00:42     INFO -    11 ?        00:00:00 netns
22:00:42     INFO -    12 ?        00:00:00 xenwatch
22:00:42     INFO -    13 ?        00:00:00 xenbus
22:00:42     INFO -    14 ?        00:00:00 sync_supers
22:00:42     INFO -    15 ?        00:00:00 bdi-default
22:00:42     INFO -    16 ?        00:00:00 kintegrityd
22:00:42     INFO -    17 ?        00:00:00 kblockd
22:00:42     INFO -    18 ?        00:00:00 ata_sff
22:00:42     INFO -    19 ?        00:00:00 khubd
22:00:42     INFO -    20 ?        00:00:00 md
22:00:42     INFO -    21 ?        00:00:00 kworker/0:1
22:00:42     INFO -    22 ?        00:00:00 kworker/u:1
22:00:42     INFO -    24 ?        00:00:00 khungtaskd
22:00:42     INFO -    25 ?        00:00:02 kswapd0
22:00:42     INFO -    26 ?        00:00:00 ksmd
22:00:42     INFO -    27 ?        00:00:00 fsnotify_mark
22:00:42     INFO -    28 ?        00:00:00 ecryptfs-kthrea
22:00:42     INFO -    29 ?        00:00:00 crypto
22:00:42     INFO -    37 ?        00:00:00 kthrotld
22:00:42     INFO -    38 ?        00:00:00 khvcd
22:00:42     INFO -    57 ?        00:00:00 devfreq_wq
22:00:42     INFO -   150 ?        00:00:05 flush-202:1
22:00:42     INFO -   151 ?        00:00:02 jbd2/xvda1-8
22:00:42     INFO -   152 ?        00:00:00 ext4-dio-unwrit
22:00:42     INFO -   234 ?        00:00:00 upstart-udev-br
22:00:42     INFO -   237 ?        00:00:00 udevd
22:00:42     INFO -   378 ?        00:00:00 udevd
22:00:42     INFO -   379 ?        00:00:00 udevd
22:00:42     INFO -   504 ?        00:00:00 upstart-socket-
22:00:42     INFO -   578 ?        00:00:00 dhclient3
22:00:42     INFO -   690 ?        00:00:00 sh
22:00:42     INFO -   736 ?        00:00:00 rsyslogd
22:00:42     INFO -   738 ?        00:00:00 dbus-daemon
22:00:42     INFO -   752 tty4     00:00:00 getty
22:00:42     INFO -   754 tty5     00:00:00 getty
22:00:42     INFO -   759 tty2     00:00:00 getty
22:00:42     INFO -   760 tty3     00:00:00 getty
22:00:42     INFO -   762 tty6     00:00:00 getty
22:00:42     INFO -   764 ?        00:00:00 cron
22:00:42     INFO -   815 ?        00:00:00 apache2
22:00:42     INFO -   816 ?        00:00:00 apache2
22:00:42     INFO -   819 ?        00:00:00 apache2
22:00:42     INFO -   821 ?        00:00:00 apache2
22:00:42     INFO -   896 ?        00:00:00 collectdmon
22:00:42     INFO -   899 ?        00:00:00 collectd
22:00:42     INFO -  1047 ?        00:00:00 sshd
22:00:42     INFO -  1058 tty1     00:00:00 getty
22:00:42     INFO -  1074 ?        00:00:00 console-kit-dae
22:00:42     INFO -  1141 ?        00:00:00 polkitd
22:00:42     INFO -  1212 ?        00:00:00 upowerd
22:00:42     INFO -  1223 ?        00:00:00 accounts-daemon
22:00:42     INFO -  1283 ?        00:00:00 rtkit-daemon
22:00:42     INFO -  1328 ?        00:00:00 udisks-daemon
22:00:42     INFO -  1329 ?        00:00:00 udisks-daemon
22:00:42     INFO -  1429 ?        00:00:00 runner
22:00:42     INFO -  7063 ?        00:00:00 Xvfb
22:00:42     INFO -  7089 ?        00:00:00 python
22:00:42     INFO -  7102 ?        00:00:04 twistd
22:00:42     INFO -  7123 ?        00:00:04 python
22:00:42     INFO -  7129 ?        00:00:00 ps
22:00:42     INFO - output from netstat -a -p -n -t -u follows...
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
22:00:42     INFO - Active Internet connections (servers and established)
22:00:42     INFO - Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
22:00:42     INFO - tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      -               
22:00:42     INFO - tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -               
22:00:42     INFO - tcp        0      0 10.132.157.152:46164    10.132.50.89:9201       ESTABLISHED 7102/python2.7  
22:00:42     INFO - tcp        0      0 10.132.157.152:43778    10.22.75.39:2003        ESTABLISHED -               
22:00:42     INFO - tcp6       0      0 :::22                   :::*                    LISTEN      -               
22:00:42     INFO - udp        0      0 0.0.0.0:68              0.0.0.0:*                           -      

I expect the emulator cannot fully start with the aws host in this state.
I think these diagnostics may come in handy over time and this seems like a good opportunity to add them.
Attachment #8544745 - Attachment is obsolete: true
Attachment #8545278 - Flags: review?(armenzg)
These processes are typically running on a "good" aws instance but are missing from a "bad" one:

< aptd
< bamfdaemon
< bluetooth-apple
< cat
< dbus-daemon
< dbus-launch
< dconf-service
< deja-dup-monito
< gconfd-2
< gconf-helper
< gdu-notificatio
< geoclue-master
< gnome-fallback-
< gnome-keyring-d
< gnome-screensav
< gnome-settings-
< goa-daemon
< gvfs-afc-volume
< gvfsd
< gvfsd-burn
< gvfsd-trash
< gvfs-fuse-daemo
< gvfs-gdu-volume
< gvfs-gphoto2-vo
< hud-service
< indicator-appli
< indicator-datet
< indicator-messa
< indicator-print
< indicator-sessi
< indicator-sound
< metacity
< mission-control
< nautilus
< nm-applet
< polkit-gnome-au
< pulseaudio
< sh
< ssh-agent
< su
< telepathy-indic
< ubuntu-geoip-pr
< unity-2d-panel
< unity-2d-shell
< unity-applicati
< unity-files-dae
< unity-lens-vide
< unity-music-dae
< unity-musicstor
< unity-panel-ser
< unity-scope-vid
< update-notifier
< x-session-manag
< zeitgeist-daemo
< zeitgeist-datah
< zeitgeist-fts
Comment on attachment 8545278 [details] [diff] [review]
dump process list and netstat output from android_emulator_unittest

Review of attachment 8545278 [details] [diff] [review]:
-----------------------------------------------------------------

With regards to pulseaudio, should we add a check?

Is there a way that we can start it?
Attachment #8545278 - Flags: review?(armenzg) → review+
(In reply to Armen Zambrano - Automation & Tools Engineer (:armenzg) from comment #826)
> With regards to pulseaudio, should we add a check?

We could check, but I don't think we should. Really pulseaudio is just one of several things that are necessary for the emulator to run our tests, and I don't think we know all the dependencies. The bigger, more general issue is "the emulator won't start".

> Is there a way that we can start it?

I expect we could start it -- but we should not need to, at least in a mozharness script. If it's not running, I'd prefer we have a bug like this that draws our attention to the question of why some processes are not running.
We run a tool called 'cleanslate' between jobs. This tool creates a snapshot of processes running before buildbot starts, and then kills any extra processes that are left over after buildbot stops.

We also restart Xvfb between jobs.

Either of these could be responsible for pulseaudio not running. How is it started right now?
Except that if I understood her correctly, when I blamed mrrgn for this whole thing (because, again, this is not an Android thing at all, not in any way, this is just how Android looks when run on a slave in the bug 1114541 state), she said that no-reboot is not enabled for this slave pool, and that cleanslate does nothing on the first run, so as long as we're not doing no-reboot it's not in the picture.

We did, however, switch from doing reboots as a buildbot step to doing them with runner. When did we do that?
Diagnostic patch: https://hg.mozilla.org/build/mozharness/rev/3b9076325fff

I changed "ps -A" to "ps -ef" -- requested by :catlee.
We are starting to see the new diagnostics now and they are consistent with my observations in Comment 821 and Comment 823. 

ps -ef gives a few more clues. Consider these processes which normally run but are missing from the failures here:

http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-api-9/1420819059/mozilla-inbound_ubuntu64_vm_mobile_test-mochitest-2-bm51-tests1-linux64-build520.txt.gz

09:26:45     INFO - cltbld    7291  7257  0 09:10 ?        00:00:00 su - -c DISPLAY=:0 /etc/X11/Xsession cltbld
09:26:45     INFO - cltbld    7298  7291  0 09:10 ?        00:00:00 x-session-manager
09:26:45     INFO - cltbld    7324  7298  0 09:10 ?        00:00:00 /usr/bin/ssh-agent /usr/bin/dbus-launch --exit-with-session x-session-manager

x-session-manager (7298) is the parent of many of the missing processes. 7291 starts x-session-manager. 7291's command line looks like http://hg.mozilla.org/build/puppet/annotate/d7eaf090016f/modules/gui/templates/Xsession.conf.erb#l14.
I kind of suspect this: http://hg.mozilla.org/build/puppet/rev/5a4fa0e0da97

We restart Xvfb between jobs. Perhaps not enough time is being given for the X session to come up?
:mrrrgn -- Do you know what's happening here? See comments 922, 923. We have intermittent failures on Android 2.3 (emulator) test jobs. ps -ef shows that, in the failure cases, many daemons such as x-session-manager are not running. Xvfb is running though.
Flags: needinfo?(winter2718)
Attached patch emulatorbug-noreboot-test.diff — — Splinter Review
Let's remove these variables from the situation.
Flags: needinfo?(winter2718)
Attachment #8548426 - Flags: review?(bugspam.Callek)
(In reply to Geoff Brown [:gbrown] from comment #1009)
> :mrrrgn -- Do you know what's happening here? See comments 922, 923. We have
> intermittent failures on Android 2.3 (emulator) test jobs. ps -ef shows
> that, in the failure cases, many daemons such as x-session-manager are not
> running. Xvfb is running though.

I've tried looking at cleanslate as a possible culprit. On busted hosts here's what it's killing: 2015-01-13 11:21:57,168 - DEBUG - Adding pid:2014 cmd:'update-notifier' to kill set.
2015-01-13 11:21:57,169 - DEBUG - Adding pid:2196 cmd:'adb fork-server server' to kill set.
2015-01-13 11:21:57,169 - DEBUG - Adding pid:2445 cmd:'/usr/lib/gvfs/gvfsd-metadata' to kill set.
2015-01-13 09:41:24,479 - DEBUG - Adding pid:2622 cmd:'/usr/lib/notify-osd/notify-osd' to kill set.
2015-01-13 09:41:24,480 - DEBUG - Adding pid:1751 cmd:'/usr/bin/gnome-keyring-daemon --start --components=secrets' to kill set.
2015-01-13 09:41:24,480 - DEBUG - Adding pid:2075 cmd:'update-notifier' to kill set.


I've just put in a patch to turn off reboots and cleanslate to see if that impacts the situation. As it stands, there are still too many variables for me to make a clear call about what is happening.
Comment on attachment 8548426 [details] [diff] [review]
emulatorbug-noreboot-test.diff

Review of attachment 8548426 [details] [diff] [review]:
-----------------------------------------------------------------

Doubt we need those OSX flavors in this array with this patch, but won't hurt
Attachment #8548426 - Flags: review?(bugspam.Callek) → review+
Attached patch xvfb.restart.diff (obsolete) — — Splinter Review
Attachment #8549100 - Flags: review?(bugspam.Callek)
Comment on attachment 8549100 [details] [diff] [review]
xvfb.restart.diff

Review of attachment 8549100 [details] [diff] [review]:
-----------------------------------------------------------------

I know in IRC you said you're moving to a post-task, but as a why-not let me r+ this as well right now.
Attachment #8549100 - Flags: review?(bugspam.Callek) → review+
Attached patch xvfb.restart.diff — — Splinter Review
Attachment #8549100 - Attachment is obsolete: true
Attachment #8549116 - Flags: review?(bugspam.Callek)
Attachment #8549116 - Flags: review?(bugspam.Callek) → review+
Attached patch runlevel.xvfb.diff (obsolete) — — Splinter Review
I forgot to remove the old file on the last patch. However, I already removed the old file from all of our existing instances via Ansible (and it didn't seem to stop the errors). This may not be a runner issue at all.
Attachment #8549330 - Flags: review?(bugspam.Callek)
Attachment #8549330 - Flags: review?(bugspam.Callek) → review+
Attached patch xvfb.again.diff (obsolete) — — Splinter Review
Attachment #8549330 - Attachment is obsolete: true
Attachment #8549712 - Flags: review?(bugspam.Callek)
Attached patch xvfb.again.diff (obsolete) — — Splinter Review
Attachment #8549712 - Attachment is obsolete: true
Attachment #8549712 - Flags: review?(bugspam.Callek)
Attachment #8549721 - Flags: review?(bugspam.Callek)
Comment on attachment 8549721 [details] [diff] [review]
xvfb.again.diff

Review of attachment 8549721 [details] [diff] [review]:
-----------------------------------------------------------------

err... forgot to submit my review :/

::: modules/toplevel/manifests/slave/releng/test.pp
@@ -18,5 @@
>      case $::operatingsystem {
>          "Ubuntu": {
>              include runner::tasks::update_shared_repos
>              include runner::tasks::checkout_tools
> -            include runner::tasks::restart_services

With this, you won't be actually doing that ensure => absent above...

I just suggest "" instead of "xvfb" in the above file for now, rather than remove this include.
Attachment #8549721 - Flags: review?(bugspam.Callek) → review-
Attached patch xvfb.again.diff — — Splinter Review
Attachment #8549721 - Attachment is obsolete: true
Attachment #8549752 - Flags: review?(bugspam.Callek)
Attachment #8549752 - Flags: review?(bugspam.Callek) → review+
I'm optimistically calling this fixed with the xvfb changes mrrrgn completed.

\o/
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
A few additional failures can be expected over the course of the next two days as I try to understand what's happening at a deeper level.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I'm marking this fixed (for now) based on the related bugs philor pointed here that are also fixed, and this trend graph (with the last for this bug specifically being last week):

 'times': {'2014-12-10': 1,
           '2014-12-15': 1,
           '2014-12-18': 1,
           '2014-12-19': 1,
           '2014-12-20': 8,
           '2014-12-21': 10,
           '2014-12-22': 34,
           '2014-12-23': 24,
           '2014-12-24': 14,
           '2014-12-25': 9,
           '2014-12-26': 18,
           '2014-12-27': 11,
           '2014-12-28': 9,
           '2014-12-29': 33,
           '2014-12-30': 17,
           '2014-12-31': 19,
           '2015-01-01': 14,
           '2015-01-02': 27,
           '2015-01-03': 16,
           '2015-01-04': 15,
           '2015-01-05': 43,
           '2015-01-06': 468,
           '2015-01-07': 42,
           '2015-01-08': 42,
           '2015-01-09': 37,
           '2015-01-10': 11,
           '2015-01-11': 15,
           '2015-01-12': 33,
           '2015-01-13': 28,
           '2015-01-14': 41,
           '2015-01-15': 53,
           '2015-01-16': 29,
           '2015-01-17': 14,
           '2015-01-18': 7,
           '2015-01-19': 2,
           '2015-01-22': 7,
           '2015-01-23': 1,
           '2015-01-24': 1}}
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Thank you very much!

I'm having a bit of trouble understanding the patches.
What did we actually do?

IIUC wwe resumed rebooting, then moved xvfb to be one of the last things we restart by lowering the level, then removed it completely and marked those two restarting services files to be removed.

It seems that we're now back to no-rebooting:
http://hg.mozilla.org/build/puppet/file/e6c1ba753411/modules/runner/files/post_flight.py#l38

Do we still need the removal of the two files? I assume that maybe for some time.
http://hg.mozilla.org/build/puppet/file/e6c1ba753411/modules/runner/manifests/tasks/restart_services.pp
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: