Closed Bug 1157948 Opened 10 years ago Closed 8 years ago

Intermittent-infra DMError: Non-zero return code for command: ['/system/bin/logcat', '-v', 'time', '-d', 'dalvikvm:I', 'ConnectivityService:S', 'WifiMonitor:S', 'WifiStateTracker:S', 'wpa_supplicant:S', 'NetworkStateTracker:S'] (output: '', retval: 'None'

Categories

(Testing :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(firefox40 affected)

RESOLVED DUPLICATE of bug 1280570
Tracking Status
firefox40 --- affected

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

+++ This bug was initially created as a clone of Bug #1156422 +++ Despite efforts in bug 1156422, this problem persists.
http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-api-11/1429826763/mozilla-inbound_ubuntu64_vm_armv7_mobile_test-mochitest-7-bm122-tests1-linux64-build0.txt.gz 15:38:07 INFO - getLogcat threw DMError; re-trying just once... 15:38:07 INFO - Traceback (most recent call last): 15:38:07 INFO - File "/builds/slave/test/build/tests/mochitest/runtests.py", line 2266, in doTests 15:38:07 INFO - quiet=options.quiet 15:38:07 INFO - File "/builds/slave/test/build/tests/mochitest/runtestsremote.py", line 461, in runApp 15:38:07 INFO - return self._automation.runApp(*args, **kwargs) 15:38:07 INFO - File "/builds/slave/test/build/tests/mochitest/automation.py", line 844, in runApp 15:38:07 INFO - crashed = self.checkForCrashes(os.path.join(profileDir, "minidumps"), symbolsPath) 15:38:07 INFO - File "/builds/slave/test/build/tests/mochitest/remoteautomation.py", line 188, in checkForCrashes 15:38:07 INFO - logcat = self._devicemanager.getLogcat(filterOutRegexps=fennecLogcatFilters) 15:38:07 INFO - File "/builds/slave/test/build/tests/mochitest/devicemanager.py", line 159, in getLogcat 15:38:07 INFO - root=self._logcatNeedsRoot).split('\r') 15:38:07 INFO - File "/builds/slave/test/build/tests/mochitest/devicemanager.py", line 397, in shellCheckOutput 15:38:07 INFO - raise DMError("Non-zero return code for command: %s (output: '%s', retval: '%s')" % (cmd, output, retval)) 15:38:07 INFO - DMError: Non-zero return code for command: ['/system/bin/logcat', '-v', 'time', '-d', 'dalvikvm:I', 'ConnectivityService:S', 'WifiMonitor:S', 'WifiStateTracker:S', 'wpa_supplicant:S', 'NetworkStateTracker:S'] (output: '', retval: 'None') 15:38:07 INFO - 130 ERROR Automation Error: Received unexpected exception while running application
(In reply to Geoff Brown [:gbrown] from comment #1) Complete logcat ends in: 04-23 15:36:40.704 I/GeckoDump( 790): ⰲ겿{"action":"test_status","time":1429828600703,"thread":"","js_source":"TestRunner","pid":null,"source":"mochitest","test":"/tests/dom/html/test/test_bug879319.html","subtest":"Form.tmp0 == input0","status":"PASS"}ⰲ겿 04-23 15:36:40.834 I/GeckoDump( 790): ⰲ겿{"action":"test_status","time":1429828600723,"thread":"","js_source":"TestRunner","pid":null,"source":"mochitest","test":"/tests/dom/html/test/test_bug879319.html","subtest":"Form.foo0 is still here","status":"PASS"}ⰲ겿 04-23 15:36:40.877 I/GeckoDump( 790): ⰲ겿{"action":"test_status","time":1429828600870,"thread":"","js_source":"TestRunner","pid":null,"source":"mochitest","test":"/tests/dom/html/test/test_bug879319.html","subtest":"tmp1 is in form.elements","status":"PASS"}ⰲ겿
Summary: Android 4.3 DMError: Non-zero return code for command: ['/system/bin/logcat' → DMError: Non-zero return code for command: ['/system/bin/logcat', '-v', 'time', '-d', 'dalvikvm:I', 'ConnectivityService:S', 'WifiMonitor:S', 'WifiStateTracker:S', 'wpa_supplicant:S', 'NetworkStateTracker:S'] (output: '', retval: 'None')
I don't see any sign of trouble in the emulator log, but the emulator appears to be non-responsive. 15:38:07 INFO - ##### 15:38:07 INFO - ##### Running stop-emulators step. 15:38:07 INFO - ##### 15:38:07 INFO - Running pre-action listener: _resource_record_pre_action 15:38:07 INFO - Running main action method: stop_emulators 15:38:07 INFO - Checking emulator test-1 15:38:07 INFO - Attempt #1 to connect to emulator on port 5554 15:38:07 INFO - Trying again after socket error: [Errno 111] Connection refused 15:38:07 INFO - Sleeping 30 seconds 15:38:37 INFO - Attempt #2 to connect to emulator on port 5554 15:38:37 INFO - Trying again after socket error: [Errno 111] Connection refused 15:38:37 INFO - Sleeping 30 seconds 15:39:07 INFO - Attempt #3 to connect to emulator on port 5554 15:39:07 INFO - Trying again after socket error: [Errno 111] Connection refused 15:39:07 INFO - Sleeping 30 seconds 15:39:37 INFO - Attempt #4 to connect to emulator on port 5554 15:39:37 INFO - Trying again after socket error: [Errno 111] Connection refused 15:39:37 WARNING - Unable to communicate with emulator on port 5554 error: device not found 15:39:37 INFO - ['/builds/slave/test/build/android-sdk18/platform-tools/adb', '-s', 'emulator-5554', 'shell', 'ps']: 15:39:37 INFO - 15:39:37 INFO - None 15:39:37 INFO - Let's kill every process called emulator64-arm 15:39:37 INFO - Killing pid 3058.
Bug 1123828 (and bug 1164866) made subtle changes to the failure diagnostics seen here, beginning 2015-05-22 on m-c. They may also reduce frequency of emulator problems, but probably not significantly. So far, it looks like adb does not recognize the emulator, although the emulator process is still running: 13:42:24 INFO - ##### 13:42:24 INFO - ##### Running stop-emulator step. 13:42:24 INFO - ##### 13:42:24 INFO - Running pre-action listener: _resource_record_pre_action 13:42:24 INFO - Running main action method: stop_emulator 13:42:24 INFO - Verifying adb connectivity 13:42:24 INFO - Running timeout 300 /builds/slave/test/build/android-sdk18/platform-tools/adb wait-for-device 13:47:24 INFO - Running timeout 300 /builds/slave/test/build/android-sdk18/platform-tools/adb devices 13:47:24 INFO - List of devices attached 13:47:24 WARNING - Unable to communicate with emulator via adb 13:47:24 INFO - Let's kill every process called emulator64-arm 13:47:24 INFO - Killing pid 1641.
Depends on: 1170227
Bug 1173120 (and related) recently added timeouts for all adb commands. Failures previously reported in bug 1051908 (test harness hangs, "2400 seconds without output" reported by buildbot) are now reported here, showing multiple adb timeouts. For example, comment 593, http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-api-11/1434222221/mozilla-inbound_ubuntu64_vm_armv7_mobile_test-mochitest-3-bm52-tests1-linux64-build84.txt.gz: 13:11:49 INFO - 242 INFO TEST-OK | dom/base/test/test_file_from_blob.html | took 3023ms 13:11:49 INFO - 243 INFO TEST-START | dom/base/test/test_file_negative_date.html 13:17:32 INFO - mozdevice ERROR | Timeout exceeded for shell call 'adb shell dumpsys window windows; echo $?' 13:22:32 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb shell ls -a /sdcard/tests/logs/mochitest.log' 13:27:32 INFO - mozdevice ERROR | Timeout exceeded for shell call 'adb shell dumpsys window windows; echo $?' 13:27:32 INFO - INFO | automation.py | Application ran for: 0:36:37.497010 13:27:32 INFO - INFO | zombiecheck | Reading PID log: /tmp/tmppaw6RPpidlog 13:32:32 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb shell ls -a /data/anr/traces.txt' 13:32:32 INFO - /data/anr/traces.txt not found 13:37:32 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb shell ls -a /data/tombstones/' 13:42:32 INFO - mozdevice ERROR | Timeout exceeded for shell call 'adb shell chmod 777 /data/tombstones; echo $?' 13:47:33 INFO - mozdevice ERROR | Timeout exceeded for shell call 'adb shell rm -r /data/tombstones; echo $?' 13:52:33 INFO - mozdevice ERROR | Timeout exceeded for shell call 'adb shell /system/bin/logcat -v time -d dalvikvm:I ConnectivityService:S WifiMonitor:S WifiStateTracker:S wpa_supplicant:S NetworkStateTracker:S; echo $?' 13:52:33 INFO - getLogcat threw DMError; re-trying just once... 13:57:34 INFO - mozdevice ERROR | Timeout exceeded for shell call 'adb shell /system/bin/logcat -v time -d dalvikvm:I ConnectivityService:S WifiMonitor:S WifiStateTracker:S wpa_supplicant:S NetworkStateTracker:S; echo $?' 13:57:34 INFO - Traceback (most recent call last): 13:57:34 INFO - File "/builds/slave/test/build/tests/mochitest/runtests.py", line 2296, in doTests 13:57:34 INFO - quiet=options.quiet 13:57:34 INFO - File "/builds/slave/test/build/tests/mochitest/runtestsremote.py", line 464, in runApp 13:57:34 INFO - return self._automation.runApp(*args, **kwargs) 13:57:34 INFO - File "/builds/slave/test/build/tests/mochitest/automation.py", line 844, in runApp 13:57:34 INFO - crashed = self.checkForCrashes(os.path.join(profileDir, "minidumps"), symbolsPath) 13:57:34 INFO - File "/builds/slave/test/build/tests/mochitest/remoteautomation.py", line 200, in checkForCrashes 13:57:34 INFO - logcat = self._devicemanager.getLogcat(filterOutRegexps=fennecLogcatFilters) 13:57:34 INFO - File "/builds/slave/test/build/tests/mochitest/devicemanager.py", line 160, in getLogcat 13:57:34 INFO - root=self._logcatNeedsRoot) 13:57:34 INFO - File "/builds/slave/test/build/tests/mochitest/devicemanager.py", line 399, in shellCheckOutput 13:57:34 INFO - raise DMError("Non-zero return code for command: %s (output: '%s', retval: '%s')" % (cmd, output, retval)) 13:57:34 INFO - DMError: Non-zero return code for command: ['/system/bin/logcat', '-v', 'time', '-d', 'dalvikvm:I', 'ConnectivityService:S', 'WifiMonitor:S', 'WifiStateTracker:S', 'wpa_supplicant:S', 'NetworkStateTracker:S'] (output: '', retval: 'None') 13:57:34 INFO - 244 ERROR Automation Error: Received unexpected exception while running application 13:57:34 INFO - Stopping web server 13:58:37 INFO - Stopping web socket server 13:58:37 INFO - Stopping ssltunnel 13:58:37 INFO - WARNING | leakcheck | refcount logging is off, so leaks can't be detected! 13:58:37 INFO - runtests.py | Running tests: end. 14:03:37 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb shell ls -a /sdcard/tests/logs/mochitest.log' 14:03:37 INFO - Unable to retrieve log file (/sdcard/tests/logs/mochitest.log) from remote device 14:08:37 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb shell ls -a /sdcard/tests/profile/' 14:13:37 INFO - mozdevice ERROR | Timeout exceeded for _checkCmd call 'adb shell rm -r /sdcard/tests/profile' 14:18:37 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb shell ls -a /sdcard/tests/chrome/' 14:23:37 INFO - mozdevice ERROR | Timeout exceeded for _checkCmd call 'adb shell rm -r /sdcard/tests/chrome' 14:28:38 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb shell ls -a /sdcard/tests/robotium.config' 14:33:38 INFO - mozdevice ERROR | Timeout exceeded for _runCmd call 'adb pull /sdcard/tests/nspr /builds/slave/test/build/blobber_upload_dir' 14:38:38 INFO - mozdevice ERROR | Timeout exceeded for shell call 'adb shell /system/bin/logcat -v time -d dalvikvm:I ConnectivityService:S WifiMonitor:S WifiStateTracker:S wpa_supplicant:S NetworkStateTracker:S; echo $?' 14:38:38 INFO - Error getting device information The old failure mode, with errors but no timeouts, continues (with higher frequency than the adb timeouts).
Depends on: 1175733
This still happens but, following bug 1175733, now automatically retries. Bug 1170227 tracks investigation into the root cause.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
These are being reported again, for taskcluster only, because of bug 1280570.
Depends on: 1280570
Greg, can we get this fixed in taskcluster since its frequent :(
Flags: needinfo?(garndt)
Ouch, this definitely is frequent and seems to be have been going on fro awhile. I'm not familiar with these tasks, or the error, so I'm not sure where to even begin looking into it. Is there anyone that can provide some info on this bug of how to fix this from a taskcluster perspective? On a different note, after tomorrow, I'm traveling on Friday and on PTO for 2 weeks so if this is definitely something that could be fixed from a taskcluster perspective, I would ask Selena if she can delegate this to someone (like Wander).
Flags: needinfo?(garndt)
Greg, Selena -- This error has plagued Android tests on emulators for a long time, and we have spent a lot of time investigating the root cause, but to no avail. In brief, the emulator "randomly" dies during a test job and can no longer be contacted. Ultimately, we decided it was best to treat this as an infrastructure error: It has nothing to do with the build under test or the test itself, so just mark the job as a retry (blue) in treeherder and run it again. On buildbot, automatic retry has been a satisfactory solution for over a year (comment 696). On taskcluster, afaik, there is no way for a job to automatically retry, so this is reported as an error (orange), requiring sheriff intervention, developer confusion, and annoyance. As Android Debug tests are now tier 1, this is particularly important. There is discussion in bug 1280570 about automatic retry and I can modify mozharness to report/request the retry, but I do not know what is needed on the taskcluster side -- https://bugzilla.mozilla.org/show_bug.cgi?id=1280570#c3.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(garndt)
Ok, sounds like the issue of emulator stopping might not have been a taskcluster-only issue, however, it is a taskcluster issue that one cannot retry these tasks automatically. Unfortunately we do not have it built into taskcluster currently to retry failed tasks. (that feature has been deprecated as we deprecated our scheduler) Would it be possible to update these taskcluster tasks to have a loop to retry the tests within the same task execution rather than relying on the whole task being retried...or does it require tearing down and restarting the docker container?
Flags: needinfo?(garndt)
I *think* mozharness could detect the error, stop the test harness, kill the emulator, restart the emulator, and then restart the test harness. However, I think it would be difficult to suppress error messages (important for log viewing), and even if that is possible, the resulting log would still be confusing, with some tests run twice. Automatic retry has been a useful feature of buildbot/mozharness/treeherder; I think taskcluster needs to support it too. Does bug 1284911 - adding new jobs to try runs via treeherder - make this easier? If treeherder can initiate a task, why can't the task itself?
From my understanding of bug 1284911, that feature is to be used by a user to trigger jobs that were not originally triggered on a push, not for retrying failed tasks. We already have a retrigger button within treeherder for manually retriggering failed tasks. However, at the high rate this happens that's not going to be a valid option. Some options that come to mind: 1. task should be responsible for retrying whatever it needs to to get to a green state (such as tearing down the test harness, emulator, etc and restarting). This of course has the issue of logs being much messier and longer because of multiple runs and errors. Some tests would be run twice even if this task was retried as a whole new task, it's just that the logs would be separate between them. 2. Perhaps a worker feature is to specify that the task should be retried within the same worker if it fails up to N times. In the case of our docker based workers this would just be retrying the command in a new docker container. This would be a longer term feature request.
Just a couple of clarifications, in case they help: - I don't think we want to retry on failure in general -- if so, that's a mostly separate issue -- but on specific failures that are known to be unrelated to the build or test. There are mozharness exit status codes for success/failure/exception/retry and corresponding treeherder job colors for those conditions. I'm looking for taskcluster to retry when a mozharness script exits with "retry"; current taskcluster "failure" handling is fine as far as I am concerned. - re: "high rate" -- This DMError happens about 100 times a day, but that is still "only" about 3% of Android jobs.
Moving needinfo to 1280570 since this bug is closed.
Flags: needinfo?(sdeckelmann)
please see bug 1280570 for information about this.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Assignee: gbrown → nobody
See Also: → 1293309
Depends on: 1296437
For those keeping score at home: this is the number one failure, not quite twice as frequent as the number two failure, but more than two and three put together.
Summary: DMError: Non-zero return code for command: ['/system/bin/logcat', '-v', 'time', '-d', 'dalvikvm:I', 'ConnectivityService:S', 'WifiMonitor:S', 'WifiStateTracker:S', 'wpa_supplicant:S', 'NetworkStateTracker:S'] (output: '', retval: 'None') → Intermittent-infra DMError: Non-zero return code for command: ['/system/bin/logcat', '-v', 'time', '-d', 'dalvikvm:I', 'ConnectivityService:S', 'WifiMonitor:S', 'WifiStateTracker:S', 'wpa_supplicant:S', 'NetworkStateTracker:S'] (output: '', retval: 'None'
Assignee: nobody → gbrown
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.