Closed Bug 1403501 Opened 3 years ago Closed 3 years ago

Autophone - Unit Tests broken by killing ssltunnel, xpcshell

Categories

(Testing :: Autophone, enhancement)

enhancement
Not set

Tracking

(firefox57 fixed, firefox58 fixed)

RESOLVED FIXED
mozilla58
Tracking Status
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: bc, Assigned: gbrown)

References

Details

Attachments

(1 file)

bug 1386816 started killing any instance of ssltunnel or xpcshell already running at the start of the tests with the assumption that only one instance of a test runner will be executing.

Autophone is different in that it does run multiple instances and manages by choosing ports which do not conflict between the different instances.

I noticed this testing unit tests in:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8098e330f4e7afd905dcd689b4c965ddecddf1be&selectedJob=133534799

gbrown: Is there anything you can do to help me out?
Flags: needinfo?(gbrown)
Sorry :bc!

Bug 1386816 landed almost a month ago...why did it take so long to break autophone? Or has this been an ongoing problem?

A backout of 1386816 is certainly an option: Ehsan (and others) were inconvenienced by the earlier behavior, but no automation jobs were broken.

It would be nice to make everyone happy. Two strategies come to mind:
 - detect the autophone case, and leave it alone (command line option? or don't kill when non-default port used?)
 - accurately distinguish orphan processes and only kill those; (but I don't know why the earlier implementation was not working for Ehsan)
The unit tests had become perma orange and were disabled before your patch. I just recently started trying to investigate if we could re-enable them when I noticed the problem. One issue with the way it is now, is that many times the result of having ssltunnel/xpcshell killed still shows up as green which is definitely not a good thing.

The easiest would be the command line option I think though maybe figuring out why Ehsan was having problems might be the better course.
Flags: needinfo?(gbrown)
Assignee: nobody → gbrown
See Also: → 1404012
This patch reverts the strategy for server cleanup to only kill orphaned processes. That should allow autophone to run multiple servers without interference.

I'm not sure why the old strategy - before bug 1386816 - was not working for :ehsan. One possibility is that the 'ps' commands were silently failing to find the servers, or failing to get the correct associated pid and/or ppid. Rather than rely on 'ps', I am now using psutil, which should provide much better cross-platform support. I am also adding better diagnostics, in case issues persist.

:ehsan -- I hope this continues to work for you, and would appreciate getting logs from you if you have trouble in future.


https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b6140efb0c63aebbeaf47cd40f0ce21cbf1ef9b is looking good, but I haven't managed to start any autophone tests. :(

The same try syntax worked in an earlier push, so I am hoping I just need to wait.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b869c75c383d590469a6fb768fca95f9350498b1
Attachment #8913384 - Flags: review?(bob)
Attachment #8913384 - Flags: feedback?(ehsan)
I don't know why your try hasn't invoked any jobs. I've has some problems with pulse this week, but I don't see a bunch of unconsumed pulse messages in the queues. I'll shutdown and restart and see if that helps. If not, I'll submit them manually. I'll take a look at the patch after we've got some results.
Looking good:

https://autophone.s3.amazonaws.com/v1/task/NXc0dF-kTNSlmVGVNius-w/runs/0/artifacts/public/build/a64206d7-8288-4840-8426-e73c9b278733-autophone.log

2017-09-28 15:47:49,789      nexus-6p-4 INFO     UnitTestJob try 20170928223057 opt api-23 android-api-16 crashtest-dom-media REFTEST INFO | Checking for ssltunnel processes...
2017-09-28 15:47:49,789      nexus-6p-4 INFO     UnitTestJob try 20170928223057 opt api-23 android-api-16 crashtest-dom-media REFTEST INFO | Checking for xpcshell processes...
2017-09-28 15:47:49,790      nexus-6p-4 INFO     UnitTestJob try 20170928223057 opt api-23 android-api-16 crashtest-dom-media REFTEST INFO | NOT killing {'username': 'bclary', 'ppid': 21663, 'pid': 21844, 'name': 'xpcshell'} (not an orphan?)
2017-09-28 15:47:49,790      nexus-6p-4 INFO     UnitTestJob try 20170928223057 opt api-23 android-api-16 crashtest-dom-media REFTEST INFO | NOT killing {'username': 'bclary', 'ppid': 22643, 'pid': 22894, 'name': 'xpcshell'} (not an orphan?)
2017-09-28 15:47:49,790      nexus-6p-4 INFO     UnitTestJob try 20170928223057 opt api-23 android-api-16 crashtest-dom-media REFTEST INFO | NOT killing {'username': 'bclary', 'ppid': 22652, 'pid': 22928, 'name': 'xpcshell'} (not an orphan?)
I also ran the unit tests on autophone-4:
https://treeherder.allizom.org/#/jobs?repo=try&revision=6bc6ceb4adce778f89a5e8d03de202a79a4c8f86
and things look really good. This will go a long way in re-enabling the unit tests.

I have a question about the availability of psutil. The mochitests check for its availability and set HAVE_PSUTIL but the reftests do not. xpcshell/runxpcshelltests.py and mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py both also include checks for psutil. Talos and the websocketprocessbridge both include psutil in their requirements.txt. I also see psutil in many other places in the tree. Can we guarantee that psutil is available always?

In the fall back in the mochitests, I see we are using ps -o pid,ppid,comm (and did before this patch). Will that work on Windows?

If we can guarantee psutil, can we just use psutil and drop the fallback code?
Flags: needinfo?(gbrown)
It seems to me that psutil is widely used and readily available -- pip install psutil. 

I was a little surprised to see the HAVE_PSUTIL and similar guards: Was that an over-abundance of caution, or is there some client of these modules that cannot install psutil? I don't know. 

I am reluctant to remove HAVE_PSUTIL from runtests.py in case there is some good reason for it...and also to some extent in case there is a significant population of Mozilla developers running mochitests currently without psutil.

For my reftest case, psutil is only required for Android -- remotereftest.py. I expect Android reftests have a much smaller population of users, and most of those people come straight to me if Android tests are busted, so I am feeling more cavalier: I don't want to introduce the HAVE_PSUTIL complication unless I'm convinced it is needed.

I recognize there is an element of the arbitrary in these decisions; I can be swayed...


ps -o fails on Windows -- at least on our Windows test machines. It dumps usage, which doesn't match the xpcshell/ssltunnel lines we are searching for...and the checks fail silently. I don't feel great about that, but it only matters for someone running mochitest on Windows, with an orphan server that's interfering with tests, who doesn't have psutil installed. I could replace the fallback code with a warning: "Install psutil if you want me to check for orphan servers", or perhaps just do that on Windows?
Flags: needinfo?(gbrown)
Comment on attachment 8913384 [details] [diff] [review]
kill only orphans, use psutil to find processes

Review of attachment 8913384 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for looking into it. I think this situation has already existed and we are not making it worse with this patch. In fact, this makes Autophone much happier than it has been in a while. We can continue the way you did in the patch and if we have the opportunity in the future to reconcile this across suites and platforms we can do it then.

r+. This looked good to me when I was looking at it previously. I don't have any other changes I think need to be made.
Attachment #8913384 - Flags: review?(bob) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7abc24e252c0
Do not kill non-orphan xpcshell/ssltunnel; r=bc
https://hg.mozilla.org/mozilla-central/rev/7abc24e252c0
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Comment on attachment 8913384 [details] [diff] [review]
kill only orphans, use psutil to find processes

Review of attachment 8913384 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this looks good to me.  I haven't actually tested the patch yet because I'm not 100% sure about the exact conditions which triggered the original issue, but I'll keep an eye out for future occurrences and will file more bugs with logs if they happen again.
Attachment #8913384 - Flags: feedback?(ehsan) → feedback+
You need to log in before you can comment on or make changes to this bug.