Closed
Bug 700722
Opened 13 years ago
Closed 12 years ago
Talos process checking is over-ambitious and wrong
Categories
(Testing :: Talos, defect, P1)
Testing
Talos
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: k0scist, Assigned: k0scist)
References
Details
(Whiteboard: [mozbase][mozharness+talos])
Attachments
(4 files, 2 obsolete files)
1.18 KB,
text/plain
|
Details | |
17.75 KB,
patch
|
k0scist
:
review-
|
Details | Diff | Splinter Review |
4.04 KB,
patch
|
Details | Diff | Splinter Review | |
22.77 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
Talos checks for processes based on the name through a grep of ps ax on linux. Unfortunately, it does not particularly descriminate about what the process actually is (see ffprocess_linux.GetPidsByName). This results in talos being hard to consume if you need to spawn it as a subprocess. For instance, for bug 650887, you will have to pass (e.g.) --appname /home/jhammel/firefox/firefox to the script which will in turn call `run_tests.py` which will err out in ttest.py based on this check. Talos needs to be more surgical about this. To reproduce, run something like `emacs -nw firefox.txt`, close your browser, then run what would be an otherwise successful run_tests.py invocation. You will hit this error. In addition, the PIDs of the processes should be printed, not just the process name sought, for easier diagnostics.
Assignee | ||
Comment 1•13 years ago
|
||
I would advocate that the longer term fix is moving to mozrunner ( https://bugzilla.mozilla.org/show_bug.cgi?id=698898 ) and making the check for processes optional and off by default. Really, the process name checking should be made slightly better, and possibly only kill processes based on the actual process name instead of arguments. Automation should certainly be aggressive about killing processes, but I don't feel that talos should serve as much more than an advisor here. In the short term we should probably either make the process check optional and *on* by default and maybe have a way to pass in PIDs to exclude. Really, although this is a talos bug, its a bug about how to do process management in general at mozilla and we should move towards best practices there, including having robust routines to kill/query by process. See also: https://bugzilla.mozilla.org/show_bug.cgi?id=454999
Whiteboard: [mozbase]
Comment 2•13 years ago
|
||
If you have any suggestions on APIs for mozrunner to make this easy, I'd be very interested.(In reply to Jeff Hammel [:jhammel] from comment #1) > I would advocate that the longer term fix is moving to mozrunner ( > https://bugzilla.mozilla.org/show_bug.cgi?id=698898 ) and making the check > for processes optional and off by default. Really, the process name checking > should be made slightly better, and possibly only kill processes based on > the actual process name instead of arguments. Automation should certainly > be aggressive about killing processes, but I don't feel that talos should > serve as much more than an advisor here. I agree, though I'm not 100% sure what the best API for this would be. Should the MozRunner interface have a static methods to see if the process is running? Note that's about all we can do on Fennec for Android anyway.
Assignee | ||
Comment 3•13 years ago
|
||
I think putting this in MozRunner is probably a good idea. Certainly a lot of it should go in mozprocess which mozrunner front-ends.
Assignee | ||
Comment 4•13 years ago
|
||
So this is a more robust way of identify processes based on name. This is sample code (NOT final form!) that currently is only tested on linux, but I'm thinking of adding something like this to Talos and to mozprocess with, per :wlach's recommendation, a front-end in mozrunner.
Assignee | ||
Comment 5•13 years ago
|
||
mozbase issue ticketed: https://bugzilla.mozilla.org/show_bug.cgi?id=700862 I'll work on cleaning up how Talos handles this, but mozbase should also be augmented and eventually much of Talos will be replaced by mozrunner/mozprocess and we can then eliminate the duplicate code.
See Also: → 700862
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → jhammel
Assignee | ||
Comment 6•13 years ago
|
||
This doesn't fix ffprocess_remote.py yet, but I think the general idea is there. Note that if we go this route we may miss legitimate positives that right now being overly-aggressive catch. We can probably catch these by extending the list in FFProcess.checkAllProcesses, but I'm not sure what to add, and I'm not sure if its a burning need right now.
Attachment #573078 -
Flags: feedback?(jmaher)
Comment 7•13 years ago
|
||
Comment on attachment 573078 [details] [diff] [review] WIP Review of attachment 573078 [details] [diff] [review]: ----------------------------------------------------------------- I like this. We just need to ensure the mac zombie process detection is thought out. Also does mozprocess support this? If not, we are adding more stuff that makes it harder to switch to mozprocess, but I would argue we just need to get this in mozprocess if it doesn't.
Attachment #573078 -
Flags: feedback?(jmaher) → feedback+
Assignee | ||
Comment 8•13 years ago
|
||
I think I figured out the solution to mac zombie detection: http://k0s.org/mozilla/hg/talos-patches/rev/008d05534d76#l1.62 . However, this is from net wisdom and I have not tried in on a mac...because I don't have one. mozprocess does not have a good version of this functionality: see bug 700862 for getting it added there
Assignee | ||
Comment 9•13 years ago
|
||
In theory this should fix things a bit
Attachment #573341 -
Flags: review?(jmaher)
Assignee | ||
Comment 10•13 years ago
|
||
Oops forgot to take out -# XXX unsure how to reconcile this check with the above currently -# #overlook zombie processes -# if line.find("Z+") >= 0: -# continue They'll be removed before i commit, if approved
Comment 11•13 years ago
|
||
Comment on attachment 573341 [details] [diff] [review] go after processes in a more Review of attachment 573341 [details] [diff] [review]: ----------------------------------------------------------------- ::: talos/utils.py @@ +191,5 @@ > + prog = command[0] > + basename = os.path.basename(prog) > + if basename == name: > + retval.append((int(process['PID']), command)) > + return retval nit: this is a 4 space indentation function when the other functions you are adding are 2 space indentations. I would keep the indentation level to that of the existing file (2 space), or change the entire file to 4 space as most of talos is 4 space.
Attachment #573341 -
Flags: review?(jmaher) → review+
Comment 12•13 years ago
|
||
Comment on attachment 573341 [details] [diff] [review] go after processes in a more Overall I really like this patch. +++ b/talos/devicemanagerADB.py @@ -203,17 +203,17 @@ class DeviceManagerADB(DeviceManager): if (data[0].find("No such file or directory") != -1): return [] if (data[0].find("Not a directory") != -1): return [] return data # external function # returns: - # success: array of process tuples + # success: a list of [ [pid, appname, userid], ...] IMO we should file a seperate bug for this to do the following: 1. Add this documentation to the same message in devicemanager.py and devicemanagerSUT.py 2. Remove the superfluous call to convert the triplet to a list 3. Mirror the changes to these files in m-c 2.
Attachment #573341 -
Flags: feedback+
Assignee | ||
Comment 13•13 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #11) > Comment on attachment 573341 [details] [diff] [review] [diff] [details] [review] > go after processes in a more > > Review of attachment 573341 [details] [diff] [review] [diff] [details] [review]: > ----------------------------------------------------------------- > > ::: talos/utils.py > @@ +191,5 @@ > > + prog = command[0] > > + basename = os.path.basename(prog) > > + if basename == name: > > + retval.append((int(process['PID']), command)) > > + return retval > > nit: this is a 4 space indentation function when the other functions you are > adding are 2 space indentations. I would keep the indentation level to that > of the existing file (2 space), or change the entire file to 4 space as most > of talos is 4 space. Pushed: http://hg.mozilla.org/build/talos/rev/6e5f5cadd9e9 Oops, meant to correct that. Its fixed in the push. I also forgot to fix removing the obselete comment, so I landed a follow-up: http://hg.mozilla.org/build/talos/rev/c7c8935034a4
Assignee | ||
Comment 14•13 years ago
|
||
(In reply to William Lachance (:wlach) from comment #12) <snip/> > 1. Add this documentation to the same message in devicemanager.py and > devicemanagerSUT.py > 2. Remove the superfluous call to convert the triplet to a list > 3. Mirror the changes to these files in m-c I've done 1 and 2 and have filed bug 701420 as a follow-up for 3.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 15•13 years ago
|
||
(In reply to Jeff Hammel [:jhammel] from comment #8) > I think I figured out the solution to mac zombie detection: > http://k0s.org/mozilla/hg/talos-patches/rev/008d05534d76#l1.62 . However, > this is from net wisdom and I have not tried in on a mac...because I don't > have one. mozprocess does not have a good version of this functionality: > see bug 700862 for getting it added there You have access to VNC into this mac box right now and use it for anything you want. 10.250.5.254. Let me know if you need passwords.
Assignee | ||
Comment 16•13 years ago
|
||
Backed out due to https://bugzilla.mozilla.org/show_bug.cgi?id=702351#c6 More than likely we were doing something wrong before that our overzealous process checking was shielding us from. For instance, if we grepped for '*' we would match all PIDs and this wouldn't have happen. That doesn't make it right. I'm not sure what the path forward is here http://hg.mozilla.org/build/talos/rev/fb920d618a9f
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 17•13 years ago
|
||
Running on staging this at least fails for leopard and snowleopard: Traceback (most recent call last): File "run_tests.py", line 596, in <module> main() File "run_tests.py", line 593, in main test_file(arg, options.screen, options.amo) File "run_tests.py", line 521, in test_file browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test) File "/Users/cltbld/talos-slave/talos-data/talos/ttest.py", line 309, in runTest cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters) File "/Users/cltbld/talos-slave/talos-data/talos/cmanager_mac.py", line 105, in __init__ self.pid = self.ffprocess.GetPidsByName(process)[-1] IndexError: list index out of range I am guessing that we are looking for the wrong process name, e.g. something like firefox-bin vs firefox. But I'm not sure. When I have time I will stage this with a few more diagnostics but am glad to see it is reproducible.
Assignee | ||
Updated•13 years ago
|
Priority: -- → P1
Whiteboard: [mozbase] → [mozbase][mozharness+talos]
Comment 18•12 years ago
|
||
Learned a bit more about this type of issue when working on my mozrunner abstraction. I think we can fix this on Mac and Linux by using "ps -eo pid,comm" instead of "ps ax". That will give only the command line (e.g. /path/to/firefox) without parameters.
Assignee | ||
Comment 19•12 years ago
|
||
This will actually still suffer from the failure in comment 17 and other baddities. I would prefer definitively parsing and fixing the failure case on mac.
Assignee | ||
Comment 20•12 years ago
|
||
will test on try. this just unbitrots, doesn't fix any of the mystery problems
Attachment #573078 -
Attachment is obsolete: true
Attachment #573341 -
Attachment is obsolete: true
Attachment #600516 -
Flags: review?(jmaher)
Assignee | ||
Comment 21•12 years ago
|
||
One bad: Traceback (most recent call last): File "run_tests.py", line 661, in <module> main() File "run_tests.py", line 658, in main test_file(arg, options, parser.parsed) File "run_tests.py", line 580, in test_file browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test) File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 334, in runTest cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters) File "/home/cltbld/talos-slave/talos-data/talos/cmanager_linux.py", line 149, in __init__ self.primaryPid = self.ffprocess.GetPidsByName(process)[-1]
Assignee | ||
Comment 22•12 years ago
|
||
Comment on attachment 600516 [details] [diff] [review] unbitrot will wait to see if try turns up any other errors and then add some debugging code
Attachment #600516 -
Flags: review?(jmaher) → review-
Comment 23•12 years ago
|
||
Try run for 350e069eea13 is complete. Detailed breakdown of the results available here: https://tbpl.mozilla.org/?tree=Try&rev=350e069eea13 Results (out of 72 total builds): success: 62 failure: 10 Builds (or logs if builds failed) available at: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jhammel@mozilla.com-350e069eea13
Comment 24•12 years ago
|
||
tp is failing on linux and osx...works great on windows! from linux: WARNING: failed to os.chmod(/tmp/tmppBe7Ki/profile/lock): 2 : No such file or directory Traceback (most recent call last): File "run_tests.py", line 661, in <module> main() File "run_tests.py", line 658, in main test_file(arg, options, parser.parsed) File "run_tests.py", line 580, in test_file browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test) File "/home/cltbld/talos-slave/talos-data/talos/ttest.py", line 334, in runTest cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters) File "/home/cltbld/talos-slave/talos-data/talos/cmanager_linux.py", line 149, in __init__ self.primaryPid = self.ffprocess.GetPidsByName(process)[-1] IndexError: list index out of range from osx: Traceback (most recent call last): File "run_tests.py", line 661, in <module> main() File "run_tests.py", line 658, in main test_file(arg, options, parser.parsed) File "run_tests.py", line 580, in test_file browser_dump, counter_dump, print_format = mytest.runTest(browser_config, test) File "/Users/cltbld/talos-slave/talos-data/talos/ttest.py", line 334, in runTest cm = self.cmanager.CounterManager(self._ffprocess, browser_config['process'], counters) File "/Users/cltbld/talos-slave/talos-data/talos/cmanager_mac.py", line 106, in __init__ self.pid = self.ffprocess.GetPidsByName(process)[-1] IndexError: list index out of range
Assignee | ||
Comment 25•12 years ago
|
||
Yep, confirmed. This is the only error (visible now, anyway) and it occurs, strangely, not just based on platform but on test suite. One of the Ts and tp4 on OSX and linux (both 64bit and 32bit) fail with this message. The problem seems to be: python PerfConfigurator.py -v -e ../firefox/firefox-bin -t talos-r3-fed-066 --branchName Try --resultsServer graphs-old.mozilla.org --resultsLink /server/collect.cgi --activeTests tp5r --mozAfterPaint --responsiveness --ignoreFirst --sampleConfig sample.2.config --symbolsPath ../symbols But in fact we are checking for process='firefox' here
Assignee | ||
Comment 26•12 years ago
|
||
hmmm, possibly a red-herring? Successful runs also use -e path/to/firefox-bin
Assignee | ||
Comment 27•12 years ago
|
||
ah, but...! these tests are the only ones with counters hence the only ones that hit this code path. So comment 25 is correct. The question is what to do about it. What is 'process' supposed to mean here?
Assignee | ||
Comment 28•12 years ago
|
||
I can repro locally with this yml file
Assignee | ||
Comment 29•12 years ago
|
||
So the real question is, do we need browser_config['process'] at all? If we don't, we can get the name from browser_path. If we do, then we should figure out what we want to do there...but it probably won't be easy. Buildbot will have to be changed to pass in process of some appropriate value which must be simultaneously deployed with talos changes
Assignee | ||
Comment 30•12 years ago
|
||
Attachment #601008 -
Flags: review?(jmaher)
Assignee | ||
Comment 31•12 years ago
|
||
looks pretty green on try: https://tbpl.mozilla.org/?tree=Try&rev=38853f302ad0 Still needs android testing
Whiteboard: [mozbase][mozharness+talos] → [mozbase][mozharness+talos][talos-checkin-needed]
Comment 32•12 years ago
|
||
Try run for 38853f302ad0 is complete. Detailed breakdown of the results available here: https://tbpl.mozilla.org/?tree=Try&rev=38853f302ad0 Results (out of 74 total builds): success: 73 failure: 1 Builds (or logs if builds failed) available at: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jhammel@mozilla.com-38853f302ad0
Comment 33•12 years ago
|
||
Comment on attachment 601008 [details] [diff] [review] set process name based on browser_path Review of attachment 601008 [details] [diff] [review]: ----------------------------------------------------------------- looking good, thanks!
Attachment #601008 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 34•12 years ago
|
||
pushed: http://hg.mozilla.org/build/talos/rev/099171bc6da5
Status: REOPENED → RESOLVED
Closed: 13 years ago → 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Whiteboard: [mozbase][mozharness+talos][talos-checkin-needed] → [mozbase][mozharness+talos]
You need to log in
before you can comment on or make changes to this bug.
Description
•