Closed Bug 875814 Opened 8 years ago Closed 7 years ago

Hung/long-running android xpcshell tests fail with "AttributeError: 'NoneType' object has no attribute 'pid'"

Categories

(Testing :: XPCShell Harness, defect)

x86
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla30

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

Attachments

(1 file, 1 obsolete file)

We probably need to set proc.pid or change runxpcshelltests.py to not reference it, at least for remote tests.


TEST-INFO | /builds/tegra-092/test/build/tests/xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell/test_backgroundupdate.js | running test ...
Exception in thread Thread-294:
Traceback (most recent call last):
  File "/tools/python27/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/tools/python27/lib/python2.7/threading.py", line 755, in run
    self.function(*self.args, **self.kwargs)
  File "/builds/tegra-092/test/build/tests/xpcshell/runxpcshelltests.py", line 895, in <lambda>
    testTimer = Timer(HARNESS_TIMEOUT, lambda: self.testTimeout(name, proc.pid))
AttributeError: 'NoneType' object has no attribute 'pid'

Traceback (most recent call last):
  File "xpcshell/remotexpcshelltests.py", line 503, in <module>
    main()
  File "xpcshell/remotexpcshelltests.py", line 498, in main
    **options.__dict__):
  File "/builds/tegra-092/test/build/tests/xpcshell/runxpcshelltests.py", line 800, in runTests
    pStderr=pStderr, keep_going=keepGoing)
  File "/builds/tegra-092/test/build/tests/xpcshell/runxpcshelltests.py", line 907, in run_test
    stdout=pStdout, stderr=pStderr, env=self.env, cwd=test_dir)
  File "xpcshell/remotexpcshelltests.py", line 309, in launchProcess
    self.shellReturnCode = self.device.shell(cmd, f)
  File "/builds/tegra-092/test/build/tests/xpcshell/devicemanagerSUT.py", line 339, in shell
    raise DMError("Automation Error: Error finding end of line/return value when running '%s'" % cmdline)
devicemanager.DMError: Automation Error: Error finding end of line/return value when running '/data/local/xpcb/xpcw /mnt/sdcard/tests/xpcshell/toolkit/mozapps/extensions/test/xpcshell -r /mnt/sdcard/tests/xpcshell/c/httpd.manifest --greomni /data/local/xpcb/fennec-24.0a1.en-US.android-arm.apk -m -n -s -e 'const _HTTPD_JS_PATH = "/mnt/sdcard/tests/xpcshell/c/httpd.js";' -e 'const _HEAD_JS_PATH = "/mnt/sdcard/tests/xpcshell/head.js";' -e 'const _TESTING_MODULES_DIR = "/mnt/sdcard/tests/xpcshell/m";' -f /mnt/sdcard/tests/xpcshell/head.js -e 'const _SERVER_ADDR = "localhost"' -e 'const _HEAD_FILES = ["/mnt/sdcard/tests/xpcshell/toolkit/mozapps/extensions/test/xpcshell/head_addons.js"];' -e 'const _TAIL_FILES = [];' -e 'const _TEST_FILE = ["test_backgroundupdate.js"];' -e '_execute_test(); quit(0);''
program finished with exit code 1
Already reported for b2g - bug 873053.
See Also: → 873053
See Also: → 890422
See Also: → 892711
This is promising in that it allows us to avoid the proc.pid failure on android without affecting the desktop timeout handling. However, it exposes more problems for android timeout handling: The harness timeout is the same as the device manager timeout; relying on the device manager timeout leaves the process executing; killing the process from the harness timeout handling results in a device manager exception when launchProcess tries to parse the return code from sut -- resulting in infinite retries, just as bad as the current situation.
I assume you just want to do something like we do with Mochitest, where we have two layers of timeouts:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#589

The JS harness has its own internal timeout, and the Python harness has that + 30s to fall back to.

If device manager times out, can we detect that and then kill the process from Python?
Blocks: 873053
See Also: 873053
Blocks: 967704
Assignee: nobody → gbrown
It was good to step away from this for a bit -- when I came back, I saw how I was confusing myself.

Once the mechanics of passing proc vs proc.PID were worked out, hung xpcshell tests continued to fail with DMErrors...but that was just because the xpcshell process never completed the test. What happens is that the dm.shell() issues a SUT 'exec' command, and SUTagent waits for xpcshell to finish for 5 minutes -- so launchProcess is still running when the Python timer expires and kills the process. devicemanager may see no return code because the xpcshell process is still running when sut or devicemanager times out or because the xpcshell process is killed from the timer handler...whatever the reason, devicemanager will raise an exception. In other circumstances, that DMError is likely appropriate, so I didn't change that. Instead, I catch the DMError and ignore it in the case of a harness timeout -- seems to work fine and delivers a nice tidy and descriptive failure message.

Here's a try run without any changes except the introduction of a hanging test:

https://tbpl.mozilla.org/?tree=Try&rev=8e2ab989bd16

Notice that Linux reports a simple "Test timed out" while Android reports "AttributeError: 'NoneType' object has no attribute 'pid'
raise DMError("Automation Error: Error finding end of line/return value when running '%s'" % cmdline)"

Here's the same test + this patch applied:

https://tbpl.mozilla.org/?tree=Try&rev=30a45d0be5ad

Now we are all on the same page: "Test timed out."
Attachment #799155 - Attachment is obsolete: true
Attachment #8372302 - Flags: review?(jmaher)
Comment on attachment 8372302 [details] [diff] [review]
fix xpcshell hang/timeout handling

Review of attachment 8372302 [details] [diff] [review]:
-----------------------------------------------------------------

I am really not sure about the self.timedout variable.  The try server run does give me proof of this working- but this patch as a standalone patch doesn't help me understand it.

::: testing/xpcshell/remotexpcshelltests.py
@@ +137,5 @@
> +                if self.timedout:
> +                    # If the test timed out, there is a good chance the SUTagent also
> +                    # timed out and failed to return a return code, generating a
> +                    # DMError. Ignore the DMError to simplify the error report.
> +                    self.shellReturnCode = None

who sets self.timedout?  is there a thread that does it?

@@ +140,5 @@
> +                    # DMError. Ignore the DMError to simplify the error report.
> +                    self.shellReturnCode = None
> +                    pass
> +                else:
> +                    raise e

one would have to assume that raising an exception inside of a with statement will properly close the file.

@@ +159,5 @@
> +            # (first) starts, so its lack of presence is a hint that
> +            # something went wrong.
> +            print "Automation Error: No crash directory (%s) found on remote device" % self.remoteMinidumpDir
> +            # Whilst no crash was found, the run should still display as a failure
> +            return True

this is cut/paste from the mochitest runner?
Attachment #8372302 - Flags: review?(jmaher) → review-
(In reply to Joel Maher (:jmaher) from comment #5)
> ::: testing/xpcshell/remotexpcshelltests.py
> @@ +137,5 @@
> > +                if self.timedout:
> > +                    # If the test timed out, there is a good chance the SUTagent also
> > +                    # timed out and failed to return a return code, generating a
> > +                    # DMError. Ignore the DMError to simplify the error report.
> > +                    self.shellReturnCode = None
> 
> who sets self.timedout?  is there a thread that does it?
 
It is set here:

+    def testTimeout(self, test_file, proc):
+        self.timedout = True

testTimeout() is called from existing code -- http://hg.mozilla.org/mozilla-central/annotate/d05c721ea1b0/testing/xpcshell/runxpcshelltests.py#l611 -- in a threading.Timer object, http://docs.python.org/2/library/threading.html#timer-objects. So testTimeout is called in a separate thread, typically while devicemanager.shell() is running, waiting to hear back from sut.

> one would have to assume that raising an exception inside of a with
> statement will properly close the file.

Just so: http://effbot.org/zone/python-with-statement.htm.

> @@ +159,5 @@
> > +            # (first) starts, so its lack of presence is a hint that
> > +            # something went wrong.
> > +            print "Automation Error: No crash directory (%s) found on remote device" % self.remoteMinidumpDir
> > +            # Whilst no crash was found, the run should still display as a failure
> > +            return True
> 
> this is cut/paste from the mochitest runner?

It is derived from http://hg.mozilla.org/mozilla-central/annotate/d05c721ea1b0/build/mobile/remoteautomation.py#l145. The issue here is that when xpcshell is killed, it occasionally does not create its minidump directory. That is a good thing to report, but without this code, we call devicemanager.getDirectory() which throws a DMError, bringing the whole test run to an end.
Comment on attachment 8372302 [details] [diff] [review]
fix xpcshell hang/timeout handling

thanks for clarifying everything here.  I appreciate the context and feel comfortable with this patch.
Attachment #8372302 - Flags: review- → review+
https://hg.mozilla.org/mozilla-central/rev/e4622166d378
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Duplicate of this bug: 890422
You need to log in before you can comment on or make changes to this bug.