Closed Bug 833765 Opened 8 years ago Closed 8 years ago

Intermittent DMError: Automation Error: DeviceManager isdir returned null | Exception caught while running tests

Categories

(Testing :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla23

People

(Reporter: RyanVM, Assigned: gbrown)

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=19042299&tree=Mozilla-Inbound

Android 4.0 Panda mozilla-inbound opt test mochitest-8 on 2013-01-22 23:56:52 PST for push 2d5b54f00e2e
slave: panda-0580

INFO | runtests.py | Installing extension at /builds/panda-0580/test/build/tests/mochitest/extensions/roboextender@mozilla.org to /tmp/tmpuhFNpb.
INFO | runtests.py | Installing extension at /builds/panda-0580/test/build/tests/mochitest/extensions/specialpowers to /tmp/tmpuhFNpb.
INFO | runtests.py | Installing extension at /builds/panda-0580/test/build/tests/mochitest/extensions/workerbootstrap to /tmp/tmpuhFNpb.
INFO | runtests.py | Installing extension at /builds/panda-0580/test/build/tests/mochitest/extensions/worker to /tmp/tmpuhFNpb.
Traceback (most recent call last):
  File "mochitest/runtestsremote.py", line 577, in main
  File "/builds/panda-0580/test/build/tests/mochitest/runtests.py", line 705, in runTests
    manifest = self.buildProfile(options)
  File "mochitest/runtestsremote.py", line 299, in buildProfile
  File "/builds/panda-0580/test/build/tests/mochitest/devicemanagerSUT.py", line 442, in removeDir
    if self.dirExists(remoteDir):
  File "/builds/panda-0580/test/build/tests/mochitest/devicemanagerSUT.py", line 398, in dirExists
    raise DMError('Automation Error: DeviceManager isdir returned null')
DMError: Automation Error: DeviceManager isdir returned null
Traceback (most recent call last):
  File "mochitest/runtestsremote.py", line 595, in <module>
Automation Error: Exception caught while running tests
    main()
  File "mochitest/runtestsremote.py", line 581, in main
    mochitest.stopWebServer(options)
  File "mochitest/runtestsremote.py", line 292, in stopWebServer
    self.server.stop()
AttributeError: 'MochiRemote' object has no attribute 'server'
program finished with exit code 1
elapsedTime=6.779581
TinderboxPrint: mochitest-plain<br/><em class="testfail">T-FAIL</em>
Unknown Error: command finished with exit code: 1
Almost all of the recent failures occur in robocop's testSystemPages and show a reboot. Many logs have:

W/Watchdog( 1403): *** WATCHDOG KILLING SYSTEM PROCESS: null

just before the reboot.
Also (Comments 18, 21):

*** WATCHDOG KILLING SYSTEM PROCESS: com.android.server.PowerManagerService
I am not sure about the cause of the WATCHDOG messages, but....

Notice that in most of these logs, the first thing that goes wrong is not "isdir returned null" but "Timeout in command exec".

Here's a try push that shows the same thing happening during a special long-running test with devicemanagerSUT debug level 4: https://tbpl.mozilla.org/?tree=Try&rev=f4960773a683. Here we see that we issue a sut exec call and wait for response:

sent cmd: exec "MOZ_CRASHREPORTER=1,XPCOM_DEBUG_BREAK=stack,MOZ_HIDE_RESULTS_TABLE=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1,MOZ_PROCESS_LOG=/tmp/tmp0AILfZpidlog,XPCOM_MEM_BLOAT_LOG=/tmp/tmpDYoKnO/runtests_leaks.log" am instrument -w -e deviceroot /mnt/sdcard/tests -e class org.mozilla.fennec.tests.testLong2 org.mozilla.roboexample.test/org.mozilla.fennec.FennecInstrumentationTestRunner
recv'ing...
... (about 300 lines)
recv'ing...
recv'ing...
INFO | runtests.py | Received unexpected exception while running application
Traceback (most recent call last):
  File "/builds/panda-0559/test/build/tests/mochitest/runtests.py", line 768, in runTests
    onLaunch=onLaunch)
  File "/builds/panda-0559/test/build/tests/mochitest/automation.py", line 1207, in runApp
    stderr = subprocess.STDOUT)
  File "/builds/panda-0559/test/build/tests/mochitest/remoteautomation.py", line 175, in Process
    return self.RProcess(self._devicemanager, cmd, stdout, stderr, env, cwd)
  File "/builds/panda-0559/test/build/tests/mochitest/remoteautomation.py", line 185, in __init__
    self.proc = dm.launchProcess(cmd, stdout, cwd, env, True)
  File "/builds/panda-0559/test/build/tests/mochitest/devicemanagerSUT.py", line 519, in launchProcess
    self.fireProcess(cmdline, failIfRunning)
  File "/builds/panda-0559/test/build/tests/mochitest/devicemanagerSUT.py", line 472, in fireProcess
    self._runCmds([{ 'cmd': 'exec ' + appname }])
  File "/builds/panda-0559/test/build/tests/mochitest/devicemanagerSUT.py", line 151, in _runCmds
    self._sendCmds(cmdlist, outputfile, timeout, retryLimit=retryLimit)
  File "/builds/panda-0559/test/build/tests/mochitest/devicemanagerSUT.py", line 132, in _sendCmds
    raise err
DMError: Automation Error: Timeout in command exec 

This does not always fail this way. In particular, it usually succeeds on tegras, and often fails on pandas.

devicemanagerSUT is waiting for the $> prompt response for 300 seconds. sutagent is also waiting 300 s for program completion: https://hg.mozilla.org/mozilla-central/file/40a228f74389/build/mobile/sutagent/android/DoCommand.java#l3734

I'm thinking that increasing the devicemanagerSUT timeout by a few seconds will help us on long-running (more than 5 minutes) tests.
Assignee: nobody → gbrown
Special long-running test, before: 

https://tbpl.mozilla.org/?tree=Try&rev=f4960773a683

And with devicemanagerSUT timeout bumped up from 300 to 320 seconds:

https://tbpl.mozilla.org/?tree=Try&rev=165cb260ff89


And here we check it against the full set of tests:

https://tbpl.mozilla.org/?tree=Try&rev=995bc18baa6d

There are still panda rc failures here, but no "isdir returned null" or "Timeout in command exec" (so far, but I'll retry some more).
Why 320? Since sutagent is waiting in 10s increments, I want an extra 10s in case there is an extra increment somehow. And then I added another 10s for network delay, etc. It's more than needed, but still only a 7% increase.
Attachment #734077 - Flags: review?(jmaher)
Attachment #734077 - Flags: feedback?(wlachance)
Comment on attachment 734077 [details] [diff] [review]
increase dmSUT default timeout from 300 to 320 s

Review of attachment 734077 [details] [diff] [review]:
-----------------------------------------------------------------

this is only a robocop issue as mochitest/talos/reftest all fire the process and then poll for the process.  Robocop (due to the instrumentation interface) does not return from the fireCommand until it is completely done.  So this patch makes sense.  Could we instead do this inside of runtestsremote.py:

if robocop:
    dm.default_timeout = 320

If we could go that route I would be happier.  If it doesn't work, I am fine with this patch. r+ as the solution is valid.
Attachment #734077 - Flags: review?(jmaher) → review+
As suggested, here is a patch that limits the change to robocop tests. It seems to work fine.

I am a little concerned that there might be some way to hit the 300 s timeout during other test types, but I think either approach is acceptable.
Attachment #734702 - Flags: review?(jmaher)
Attachment #734702 - Flags: feedback?(wlachance)
Comment on attachment 734702 [details] [diff] [review]
increase dmSUT default timeout during robocop tests only

Review of attachment 734702 [details] [diff] [review]:
-----------------------------------------------------------------

I like this better, something to keep in mind for other harnesses, but I would rather hack it from the harness level than DM.
Attachment #734702 - Flags: review?(jmaher) → review+
https://hg.mozilla.org/mozilla-central/rev/98b3cdc05f18
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Comment on attachment 734077 [details] [diff] [review]
increase dmSUT default timeout from 300 to 320 s

Makes sense
Attachment #734077 - Flags: feedback?(wlachance) → feedback+
Comment on attachment 734702 [details] [diff] [review]
increase dmSUT default timeout during robocop tests only

Makes sense
Attachment #734702 - Flags: feedback?(wlachance) → feedback+
You need to log in before you can comment on or make changes to this bug.