Closed Bug 700722 Opened 13 years ago Closed 12 years ago

Talos process checking is over-ambitious and wrong

Categories

(Testing :: Talos, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: k0scist, Assigned: k0scist)

References

Details

(Whiteboard: [mozbase][mozharness+talos])

Attachments

(4 files, 2 obsolete files)

Talos checks for processes based on the name through a grep of ps ax
on linux.  Unfortunately, it does not particularly descriminate about
what the process actually is (see ffprocess_linux.GetPidsByName).
This results in talos being hard to consume if you need to spawn it as
a subprocess.  For instance, for bug 650887, you will have to pass (e.g.)

--appname /home/jhammel/firefox/firefox

to the script which will in turn call `run_tests.py` which will err
out in ttest.py based on this check.  Talos needs to be more surgical
about this.

To reproduce, run something like `emacs -nw firefox.txt`, close your
browser, then run what would be an otherwise successful run_tests.py
invocation.  You will hit this error.

In addition, the PIDs of the processes should be printed, not just the
process name sought, for easier diagnostics.
Blocks: 650887
I would advocate that the longer term fix is moving to mozrunner ( https://bugzilla.mozilla.org/show_bug.cgi?id=698898 ) and making the check for processes optional and off by default. Really, the process name checking should be made slightly better, and possibly only kill processes based on the actual process name instead of arguments.  Automation should certainly be aggressive about killing processes, but I don't feel that talos should serve as much more than an advisor here.

In the short term we should probably either make the process check optional and *on* by default and maybe have a way to pass in PIDs to exclude.

Really, although this is a talos bug, its a bug about how to do process management in general at mozilla and we should move towards best practices there, including having robust routines to kill/query by process.

See also: https://bugzilla.mozilla.org/show_bug.cgi?id=454999
Whiteboard: [mozbase]
If you have any suggestions on APIs for mozrunner to make this easy, I'd be very interested.(In reply to Jeff Hammel [:jhammel] from comment #1)
> I would advocate that the longer term fix is moving to mozrunner (
> https://bugzilla.mozilla.org/show_bug.cgi?id=698898 ) and making the check
> for processes optional and off by default. Really, the process name checking
> should be made slightly better, and possibly only kill processes based on
> the actual process name instead of arguments.  Automation should certainly
> be aggressive about killing processes, but I don't feel that talos should
> serve as much more than an advisor here.

I agree, though I'm not 100% sure what the best API for this would be. Should the MozRunner interface have a static methods to see if the process is running? Note that's about all we can do on Fennec for Android anyway.
I think putting this in MozRunner is probably a good idea. Certainly a lot of it should go in mozprocess which mozrunner front-ends.
So this is a more robust way of identify processes based on name.  This is sample code (NOT final form!) that currently is only tested on linux, but I'm thinking of adding something like this to Talos and to mozprocess with, per :wlach's recommendation, a front-end in mozrunner.
mozbase issue ticketed: https://bugzilla.mozilla.org/show_bug.cgi?id=700862

I'll work on cleaning up how Talos handles this, but mozbase should also be augmented and eventually much of Talos will be replaced by mozrunner/mozprocess and we can then eliminate the duplicate code.
See Also: → 700862
Assignee: nobody → jhammel
Attached patch WIP (obsolete) — Splinter Review
This doesn't fix ffprocess_remote.py yet, but I think the general idea is there.  Note that if we go this route we may miss legitimate positives that right now being overly-aggressive catch.  We can probably catch these by extending the list in FFProcess.checkAllProcesses, but I'm not sure what to add, and I'm not sure if its a burning need right now.
Attachment #573078 - Flags: feedback?(jmaher)
Comment on attachment 573078 [details] [diff] [review]
WIP

Review of attachment 573078 [details] [diff] [review]:
-----------------------------------------------------------------

I like this.  We just need to ensure the mac zombie process detection is thought out.  Also does mozprocess support this?  If not, we are adding more stuff that makes it harder to switch to mozprocess, but I would argue we just need to get this in mozprocess if it doesn't.
Attachment #573078 - Flags: feedback?(jmaher) → feedback+
I think I figured out the solution to mac zombie detection: http://k0s.org/mozilla/hg/talos-patches/rev/008d05534d76#l1.62 .  However, this is from net wisdom and I have not tried in on a mac...because I don't have one.  mozprocess does not have a good version of this functionality: see bug 700862 for getting it added there
Attached patch go after processes in a more (obsolete) — Splinter Review
In theory this should fix things a bit
Attachment #573341 - Flags: review?(jmaher)
Oops forgot to take out

-# XXX unsure how to reconcile this check with the above currently
-#            #overlook zombie processes
-#            if line.find("Z+") >= 0:
-#                continue

They'll be removed before i commit, if approved
Comment on attachment 573341 [details] [diff] [review]
go after processes in a more

Review of attachment 573341 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/utils.py
@@ +191,5 @@
> +        prog = command[0]
> +        basename = os.path.basename(prog)
> +        if basename == name:
> +            retval.append((int(process['PID']), command))
> +    return retval

nit: this is a 4 space indentation function when the other functions you are adding are 2 space indentations.  I would keep the indentation level to that of the existing file (2 space), or change the entire file to 4 space as most of talos is 4 space.
Attachment #573341 - Flags: review?(jmaher) → review+
Comment on attachment 573341 [details] [diff] [review]
go after processes in a more

Overall I really like this patch. 

+++ b/talos/devicemanagerADB.py
@@ -203,17 +203,17 @@ class DeviceManagerADB(DeviceManager):
           if (data[0].find("No such file or directory") != -1):
               return []
           if (data[0].find("Not a directory") != -1):
               return []
       return data
 
   # external function
   # returns:
-  #  success: array of process tuples
+  #  success: a list of [ [pid, appname, userid], ...]

IMO we should file a seperate bug for this to do the following:

1. Add this documentation to the same message in devicemanager.py and devicemanagerSUT.py
2. Remove the superfluous call to convert the triplet to a list
3. Mirror the changes to these files in m-c
2.
Attachment #573341 - Flags: feedback+
(In reply to Joel Maher (:jmaher) from comment #11)
> Comment on attachment 573341 [details] [diff] [review] [diff] [details] [review]
> go after processes in a more
> 
> Review of attachment 573341 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> ::: talos/utils.py
> @@ +191,5 @@
> > +        prog = command[0]
> > +        basename = os.path.basename(prog)
> > +        if basename == name:
> > +            retval.append((int(process['PID']), command))
> > +    return retval
> 
> nit: this is a 4 space indentation function when the other functions you are
> adding are 2 space indentations.  I would keep the indentation level to that
> of the existing file (2 space), or change the entire file to 4 space as most
> of talos is 4 space.

Pushed: http://hg.mozilla.org/build/talos/rev/6e5f5cadd9e9

Oops, meant to correct that.  Its fixed in the push.

I also forgot to fix removing the obselete comment, so I landed a follow-up: http://hg.mozilla.org/build/talos/rev/c7c8935034a4
(In reply to William Lachance (:wlach) from comment #12)
<snip/>
> 1. Add this documentation to the same message in devicemanager.py and
> devicemanagerSUT.py
> 2. Remove the superfluous call to convert the triplet to a list
> 3. Mirror the changes to these files in m-c

I've done 1 and 2 and have filed bug 701420 as a follow-up for 3.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Jeff Hammel [:jhammel] from comment #8)
> I think I figured out the solution to mac zombie detection:
> http://k0s.org/mozilla/hg/talos-patches/rev/008d05534d76#l1.62 .  However,
> this is from net wisdom and I have not tried in on a mac...because I don't
> have one.  mozprocess does not have a good version of this functionality:
> see bug 700862 for getting it added there

You have access to VNC into this mac box right now and use it for anything you want.
10.250.5.254.  Let me know if you need passwords.
Backed out due to https://bugzilla.mozilla.org/show_bug.cgi?id=702351#c6

More than likely we were doing something wrong before that our overzealous process checking was shielding us from.  For instance, if we grepped for '*' we would match all PIDs and this wouldn't have happen.  That doesn't make it right.

I'm not sure what the path forward is here

http://hg.mozilla.org/build/talos/rev/fb920d618a9f
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Running on staging this at least fails for leopard and snowleopard:

Traceback (most recent call last):
  File "run_tests.py", line 596, in <module>
    main()
  File "run_tests.py", line 593, in main
    test_file(arg, options.screen, options.amo)
  File "run_tests.py", line 521, in test_file
    browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test)
  File "/Users/cltbld/talos-slave/talos-data/talos/ttest.py", line 309, in runTest
    cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters)
  File "/Users/cltbld/talos-slave/talos-data/talos/cmanager_mac.py", line 105, in __init__
    self.pid = self.ffprocess.GetPidsByName(process)[-1]
IndexError: list index out of range

I am guessing that we are looking for the wrong process name, e.g. something like firefox-bin vs firefox. But I'm not sure.  When I have time I will stage this with a few more diagnostics but am glad to see it is reproducible.
Blocks: 713055
Priority: -- → P1
Whiteboard: [mozbase] → [mozbase][mozharness+talos]
See Also: → 454999
Learned a bit more about this type of issue when working on my mozrunner abstraction.

I think we can fix this on Mac and Linux by using "ps -eo pid,comm" instead of "ps ax". That will give only the command line (e.g. /path/to/firefox) without parameters.
This will actually still suffer from the failure in comment 17 and other baddities.  I would prefer definitively parsing and fixing the failure case on mac.
Attached patch unbitrotSplinter Review
will test on try. this just unbitrots, doesn't fix any of the mystery problems
Attachment #573078 - Attachment is obsolete: true
Attachment #573341 - Attachment is obsolete: true
Attachment #600516 - Flags: review?(jmaher)
One bad: 

Traceback (most recent call last):
  File "run_tests.py", line 661, in <module>
    main()
  File "run_tests.py", line 658, in main
    test_file(arg, options, parser.parsed)
  File "run_tests.py", line 580, in test_file
    browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test)
  File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 334, in runTest
    cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters)
  File "/home/cltbld/talos-slave/talos-data/talos/cmanager_linux.py", line 149, in __init__
    self.primaryPid = self.ffprocess.GetPidsByName(process)[-1]
Comment on attachment 600516 [details] [diff] [review]
unbitrot

will wait to see if try turns up any other errors and then add some debugging code
Attachment #600516 - Flags: review?(jmaher) → review-
Try run for 350e069eea13 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=350e069eea13
Results (out of 72 total builds):
    success: 62
    failure: 10
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jhammel@mozilla.com-350e069eea13
tp is failing on linux and osx...works great on windows!  

from linux:
WARNING: failed to os.chmod(/tmp/tmppBe7Ki/profile/lock): 2 : No such file or directory
Traceback (most recent call last):
  File "run_tests.py", line 661, in <module>
    main()
  File "run_tests.py", line 658, in main
    test_file(arg, options, parser.parsed)
  File "run_tests.py", line 580, in test_file
    browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test)
  File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 334, in runTest
    cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters)
  File "/home/cltbld/talos-slave/talos-data/talos/cmanager_linux.py", line 149, in __init__
    self.primaryPid = self.ffprocess.GetPidsByName(process)[-1]
IndexError: list index out of range


from osx:
Traceback (most recent call last):
  File "run_tests.py", line 661, in <module>
    main()
  File "run_tests.py", line 658, in main
    test_file(arg, options, parser.parsed)
  File "run_tests.py", line 580, in test_file
    browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test)
  File "/Users/cltbld/talos-slave/talos-data/talos/ttest.py", line 334, in runTest
    cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters)
  File "/Users/cltbld/talos-slave/talos-data/talos/cmanager_mac.py", line 106, in __init__
    self.pid = self.ffprocess.GetPidsByName(process)[-1]
IndexError: list index out of range
Yep, confirmed.  This is the only error (visible now, anyway) and it occurs, strangely, not just based on platform but on test suite.  One of the Ts and tp4 on OSX and linux (both 64bit and 32bit) fail with this message.  

The problem seems to be:

python PerfConfigurator.py -v -e ../firefox/firefox-bin -t talos-r3-fed-066 --branchName Try --resultsServer graphs-old.mozilla.org --resultsLink /server/collect.cgi --activeTests tp5r --mozAfterPaint --responsiveness --ignoreFirst --sampleConfig sample.2.config --symbolsPath ../symbols

But in fact we are checking for process='firefox' here
hmmm, possibly a red-herring?  Successful runs also use -e path/to/firefox-bin
ah, but...! these tests are the only ones with counters hence the only ones that hit this code path. So comment 25 is correct.  The question is what to do about it.  What is 'process' supposed to mean here?
I can repro locally with this yml file
So the real question is, do we need browser_config['process'] at all?  If we don't, we can get the name from browser_path.  If we do, then we should figure out what we want to do there...but it probably won't be easy.  Buildbot will have to be changed to pass in process of some appropriate value which must be simultaneously deployed with talos changes
Attachment #601008 - Flags: review?(jmaher)
looks pretty green on try: https://tbpl.mozilla.org/?tree=Try&rev=38853f302ad0

Still needs android testing
Whiteboard: [mozbase][mozharness+talos] → [mozbase][mozharness+talos][talos-checkin-needed]
Try run for 38853f302ad0 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=38853f302ad0
Results (out of 74 total builds):
    success: 73
    failure: 1
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jhammel@mozilla.com-38853f302ad0
Comment on attachment 601008 [details] [diff] [review]
set process name based on browser_path

Review of attachment 601008 [details] [diff] [review]:
-----------------------------------------------------------------

looking good, thanks!
Attachment #601008 - Flags: review?(jmaher) → review+
pushed: http://hg.mozilla.org/build/talos/rev/099171bc6da5
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Whiteboard: [mozbase][mozharness+talos][talos-checkin-needed] → [mozbase][mozharness+talos]
Blocks: 865311
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: