Closed Bug 923770 Opened 11 years ago Closed 11 years ago

Firefox doesn't close properly in some cases, nor does bcontroller, causing timout exceeded messages

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(2 files, 2 obsolete files)

My favorite bug 798219 has moved up in priority.  In fact, this bug is next to impossible to reproduce locally.  By doing try server debugging, I found that we were going through our code to quit, but we were not actually terminating the browser.

While this is a problem in and of itself, there is yet another problem: after finding the end of the test (via log file parsing, not process termination), sys.exit() for bcontroller failed to terminate the python process.  We ended up with a timeout in ttest.py (the launch of bcontroller.py.)

So given two problems, they must be related somehow.  I have a solution which appears to work around this problem although I am not understanding the problem fully.  It is probably true that there is some correlation between the inability to terminate firefox and bcontroller.py.  I suspect a bug in python or the operating system.
I am open to other feedback if there is a better way to solve this bug.  Really this is a first pass at solving the problem- possibly an OK way to do it.

This does 2 things:
1) if we detect no activity in the log file (already in place), talos will check for our __startTimestamp*__endTimestamp message that all tests put at the end of the log file.  If it is found, we terminate nicely
2) use os._exit() instead of sys.exit()
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #813833 - Flags: review?(jhammel)
for the record, this doesn't solve all the timeouts, but it does greatly reduce them.  It appears we still fail to complete the test in many instances and then timeout.
Comment on attachment 813833 [details] [diff] [review]
force hung browser/bcontroller.py to quit (1.0)

Review of attachment 813833 [details] [diff] [review]:
-----------------------------------------------------------------

Outside of nits, looks fine.  Should think of moving to mozrunner/mozprocess at some point of time.

::: talos/bcontroller.py
@@ +163,5 @@
>    def run(self):
>      self.bwaiter = BrowserWaiter(self.remoteProcess, **self.options)
>      noise = 0
>      prev_size = 0
> +    endre = re.compile('.*__startTimestamp.*__endTimestamp.*', re.MULTILINE|re.DOTALL)

This may want to live outside of method scope to avoid re compiling multiple times.  Bonus for making this a utility.  FWIW, regexs may be an overkill since e.g. str.index may be used to find these stamps.

@@ +193,5 @@
> +        if os.path.isfile(self.browser_log):
> +            with open(self.browser_log, 'r') as bfile:
> +                bdata = bfile.read()
> +            if endre.match(bdata):
> +                noise = -1

Why -1?

@@ +200,3 @@
>  
>      results_file = open(self.browser_log, "a")
> +    if self.bwaiter.getReturn() != 0 and noise >= 0:  #the browser shutdown, but not cleanly

no need for `!= 0`

@@ +235,4 @@
>          options.get('browser_wait', -1) <= 0 or \
>          len(options.get('browser_log', '')) < 3):
>        print >> sys.stderr, "FAIL: incorrect parameters to bcontroller\n"
> +      return -1

Should document the return codes and what they mean

@@ +243,4 @@
>  
>  if __name__ == "__main__":
> +    os._exit(main())
> +

what does os._exit buy that sys.exit does not?  Should document this e.g. as a comment.  Generally, _-prefaced methods should be avoided where possible except as last resort
Attachment #813833 - Flags: review?(jhammel) → review+
I had overlooked the xperf case.  I can add YAUH (yet another ugly hack), but I am starting to think this needs a bit more thought.  With the oncoming webrtc talos test case in bug 909524, there is a need for a pre/post test process (this could be a subprocess, or some type of routine).  

My initial thoughts are to move bcontroller.py into a setup where we do our setup, then run our process with the timeout/logwatcher/etc. then when that is done, we run the cleanup code.  In fact, this setup/cleanup could happen outside of bcontroller and managed via ttest.py.  Of course migrating to mozprocess might be a worthwhile endeavor at this point in time.

wlach, jhammel: I am open to suggestions.  Hopefully something to simplify this process and allow for a simpler way to define setup/cleanup.
Flags: needinfo?(wlachance)
Flags: needinfo?(jhammel)
I'd do as little as possible in bcontroller, as it's hard to understand/debug what it's doing. Doing the setup/cleanup inside the main talos process (e.g. ttest.py) sounds good. I agree with :jhammel that migration to mozrunner/mozprocess is ultimately what we want. We really shouldn't need to launch a seperate "bcontroller" process at all (none of the other harnesses do anything like it).
Flags: needinfo?(wlachance)
Not surprisingly, I agree with Will:  I'd move towards eliminating logic in bcontroller, eliminating launching it as a subprocess (I forget if there is any reason we don't just import it directly?), and would work on moving towards a similar model as our other test harnesses.  Moving to mozprocess/mozrunner at this point in time might be a good idea (I would advocate it, though I can't really say I can estimate all the costs).

However, this is all easy to say, and there are probably 1000 little nits that make doing so challenging.  Unless this is a Q4 goal, its hard to guess (for me, anyway) the time/necessity to put into doing this right vs. the alternatives.  In short, your plan sounds good, Joel, and a move in the right direction.
Flags: needinfo?(jhammel)
this support mozprocess.  I still need to get some additional testing on this (i.e. 200 retriggers on failing platforms, etc.), but for now this is pretty stable and green on my try push:
https://tbpl.mozilla.org/?tree=Try&rev=bf0261dd8515&showall=1

This patch adds some support for setup/cleanup scripts (needed for xperf).  In addition, this removes bcontroller, and the majority of the functionality in ffprocess_*.py.  Mobile also has a much simpler story.

Please scrutinize this in full detail.  As for secondary reviews, etc.

In the future we could cleanup more of ffprocess bits, also some of the remote management could be pulled out of ffprocess and coalesced into a single remoteProcess api which encapsulates devicemanager.
Attachment #813833 - Attachment is obsolete: true
Attachment #816017 - Flags: review?(jhammel)
fixed:
* typo in cmanager_mac.py
* added timeout to .run() command
* fixed default timeouts for various tests (i.e. no more 1 hour timeouts)
* removed some debugging messages

this is up on try server:
https://tbpl.mozilla.org/?tree=Try&rev=bf0261dd8515&showall=1

tsvgx is working great (the root cause for fixing this), but tp5 on 10.6/10.7 is having troubles, unrelated to mozprocess, but a more general issue in the last couple weeks on osx and talos/mochitest.
Attachment #816017 - Attachment is obsolete: true
Attachment #816017 - Flags: review?(jhammel)
Attachment #816512 - Flags: review?(jhammel)
Blocks: 909524
Blocks: 927312
Blocks: 924956
Comment on attachment 816512 [details] [diff] [review]
support mozprocess in talos (1.1)

Review of attachment 816512 [details] [diff] [review]:
-----------------------------------------------------------------

a somewhat wary r+.  Tons of good stuff here...a few nits....and a few things that look good but that raise my hair probably due to talosphobia and past allergies.

Good to go with nits fixed and confidence in try.  Otherwise, I'm happy to do another round of review, this time in much shorter order.

::: talos/bcontroller.py
@@ -1,2 @@
> -#!/usr/bin/env python
> -

kill it!

::: talos/cmanager.py
@@ +6,4 @@
>  
>    counterDict = {}
>  
> +  def __init__(self, process, counters=None, childProcess="plugin-container"):

I realize that this is all legacy, but it is unnecessary to pass these items to the base class if they are not used.

::: talos/cmanager_linux.py
@@ +162,5 @@
>           counters: A list of counters to monitor. Any counters whose name does
>           not match a key in 'counterDict' will be ignored.
>      """
>  
> +    CounterManager.__init__(self, process, counters)

See comment in cmanager.py

@@ +167,4 @@
>      self.childProcess = childProcess
>      self.runThread = False
>      self.pidList = []
> +    self.primaryPid = mozpid.get_pids(process)[-1]

Worth checking multiple pids?

::: talos/cmanager_mac.py
@@ +73,5 @@
>    def getCounterValue(self, counterName):
>      """Returns the last value of the counter 'counterName'"""
> +    if counterName not in self.registeredCounters:
> +        return
> +

Silent fail....sigh.  Maybe worth printing an error?

::: talos/ffprocess.py
@@ +71,3 @@
>  
>          return terminate_result
>  

I certainly like to see this gone.

::: talos/ffprocess_remote.py
@@ +255,5 @@
> +                raise talosError("missing data from remote log file")
> +
> +        # Wait out the browser closing
> +        time.sleep(browser_config['browser_wait'])
> +

This I'll leave up to your scrutiny.  Looks fine, but I haven't tested or know this path well

::: talos/ffprocess_win32.py
@@ +18,1 @@
>  

Would be nice to kill these files entirely someday.  A man can dream....

::: talos/results.py
@@ +310,4 @@
>              try:
> +               with open(filename, 'r') as f:
> +                   results_raw = f.read()
> +            except Exception, e:

should definitely be 

`except Exception, exception:`

or

`except:
    raise`

Please fix before checkin.

(You could probably just leave out the try: except clause entirely)

::: talos/talosProcess.py
@@ +10,5 @@
> +from utils import talosError
> +
> +class talosProcess(ProcessHandler):
> +    """
> +    Process handler for running peptests

peptests?

@@ +14,5 @@
> +    Process handler for running peptests
> +    """
> +    def __init__(self, cmd,
> +                       args=None, cwd=None,
> +                       env=os.environ.copy(),

default arguments shouldn't be mutable

@@ +30,5 @@
> +        if not self.logfile:
> +            return
> +
> +        if not self.results_file:
> +            self.results_file = open(self.logfile, 'w')

I'd probably open/close on each call or flush to ensure that log messages get written

@@ +40,5 @@
> +            self.results_file.close()
> +
> +    def waitForQuit(self, timeout=5):
> +        for i in range(1, timeout):
> +            if self.proc.returncode != None:

is not None

@@ +57,5 @@
> +        """
> +        When we timeout, dictate this in the log file.
> +        """
> +        if os.path.isfile(self.logfile):
> +            os.chmod(self.logfile, 0777)

pls document why

@@ +72,5 @@
> +            thread = Thread(target=self.waitForQuit)
> +            thread.setDaemon(True) # don't hang on quit
> +            thread.start()
> +        print line
> +        self.logToFile(line + "\n")

Overall looks good!

::: talos/test.py
@@ +278,4 @@
>      xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_normal_fileio', 'nonmain_normal_netio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes']
>      mobile = False # too many files to run, we will hit OOM
>      filters = [["ignore_first", [5]], ['median', []]]
> +    timeout = 1800

Not sure why the timeouts are changing?

::: talos/ttest.py
@@ +355,5 @@
> +                if not browser_config['remote']:
> +                    if test_config['setup']:
> +                        # Generate bcontroller.yml for xperf
> +                        utils.GenerateTalosConfig(command_args, browser_config, test_config)
> +                        # TODO: define this script better

What does this mean?

@@ +356,5 @@
> +                    if test_config['setup']:
> +                        # Generate bcontroller.yml for xperf
> +                        utils.GenerateTalosConfig(command_args, browser_config, test_config)
> +                        # TODO: define this script better
> +                        setup = talosProcess.talosProcess(['python'] + test_config['setup'].split(' '))

Why ' '?

@@ +376,3 @@
>  
> +                    # todo: how do we throw a timeout error
> +                    # todo: ctrl+c doesn't close the browser windows

Please file on landing?

::: talos/utils.py
@@ +123,3 @@
>  def is_running(pid, psarg='axwww'):
>    """returns if a pid is running"""
> +  return bool([i for i in mozpid.ps(psarg) if pid == int(i['PID'])])

So we should be careful that 1. the resulting consolidated code is best and more generic than the two different parts and; 2.(IMHO) that the resultant code should live in mozbase (specifically, e.g. mozbase.pid).  The code is close but doesn't entirely match:

``talos.utils``

http://hg.mozilla.org/build/talos/file/ed35a6799e25/talos/utils.py#l121
def _parse_ps():
    """parse the output of the ps command"""
    # -> where are the tests?

def ps(arg='axwww'):
   """
   python front-end to `ps`
   http://en.wikipedia.org/wiki/Ps_%28Unix%29
   """

def is_running(pid, psarg='axwww'):
  """returns if a pid is running"""

def running_processes(name, psarg='axwww', defunct=False):
  """
  returns a list of 2-tuples of running processes:
  (pid, ['path/to/executable', 'args', '...'])
  with the executable named `name`.
  - defunct: whether to return defunct processes
  """
----

``mozprocess.pid``

# https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/pid.py#L21
def ps(arg=psarg):
    """
    python front-end to `ps`
    http://en.wikipedia.org/wiki/Ps_%28Unix%29
    returns a list of process dicts based on the `ps` header
    """

# https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/pid.py#L42
def running_processes(name, psarg=psarg, defunct=True):
    """
    returns a list of
    {'PID': PID of process (int)
     'command': command line of process (list)}
     with the executable named `name`.
     - defunct: whether to return defunct processes
    """

# https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/pid.py#L73
def get_pids(name):
    """Get all the pids matching name"""

This doesn't have to be done here, but should be done as a follow up.

@@ +245,4 @@
>      except ValueError:
>          return value
>  
> +def GenerateTalosConfig(command_line, browser_config, test_config, pid=None):

I'd rather have this broken out into a separate file.  Just a nit, not a blocker.

::: talos/xtalos/xtalos.py
@@ +45,4 @@
>  
>          self.add_option("-e", "--etl_filename", dest="etl_filename",
>                          help = "Name of the .etl file to work with. Defaults to 'output.etl'")
> +        defaults["etl_filename"] = "test.etl"

changes this file herein seem unrelated but fine (AFAIK)
Attachment #816512 - Flags: review?(jhammel) → review+
Depends on: 928328
Depends on: 928326
Depends on: 928324
Depends on: 928323
Depends on: 925408
Comment on attachment 816512 [details] [diff] [review]
support mozprocess in talos (1.1)

Review of attachment 816512 [details] [diff] [review]:
-----------------------------------------------------------------

thanks for the review, I have made changes requested, filed bugs as needed and answered your questions below.

pushed to try for final verification:
https://tbpl.mozilla.org/?tree=Try&rev=44b6b2d9aa10

::: talos/cmanager_linux.py
@@ +167,4 @@
>      self.childProcess = childProcess
>      self.runThread = False
>      self.pidList = []
> +    self.primaryPid = mozpid.get_pids(process)[-1]

as the variably is named, primaryPid this is a single master process pid.  I kept parity with the existing code.  As for a future enhancement I am not sure if the firefox process will have multiple pids, as it stands today child processes will be a different name.

::: talos/test.py
@@ +278,4 @@
>      xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_normal_fileio', 'nonmain_normal_netio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes']
>      mobile = False # too many files to run, we will hit OOM
>      filters = [["ignore_first", [5]], ['median', []]]
> +    timeout = 1800

timeouts are changing because they are way too high.

::: talos/ttest.py
@@ +355,5 @@
> +                if not browser_config['remote']:
> +                    if test_config['setup']:
> +                        # Generate bcontroller.yml for xperf
> +                        utils.GenerateTalosConfig(command_args, browser_config, test_config)
> +                        # TODO: define this script better

filed bug 928323 for better documentation on test integration points.

@@ +356,5 @@
> +                    if test_config['setup']:
> +                        # Generate bcontroller.yml for xperf
> +                        utils.GenerateTalosConfig(command_args, browser_config, test_config)
> +                        # TODO: define this script better
> +                        setup = talosProcess.talosProcess(['python'] + test_config['setup'].split(' '))

bad coding on my part

@@ +376,3 @@
>  
> +                    # todo: how do we throw a timeout error
> +                    # todo: ctrl+c doesn't close the browser windows

filed bug 928324

::: talos/utils.py
@@ +123,3 @@
>  def is_running(pid, psarg='axwww'):
>    """returns if a pid is running"""
> +  return bool([i for i in mozpid.ps(psarg) if pid == int(i['PID'])])

filed bug 928328 to track the verification and unification of process listing.

@@ +245,4 @@
>      except ValueError:
>          return value
>  
> +def GenerateTalosConfig(command_line, browser_config, test_config, pid=None):

filed bug 928326 to move this into a separate file.
https://hg.mozilla.org/build/talos/rev/18ede41362ab
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 928370
this was backed out of inbound for breaking all mobile tests.  With the changing landscape of running tests I had overlooked uploading a new talos.zip
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 819437 [details] [diff] [review]
fix a few nits to make mobile work (1.0)

Review of attachment 819437 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #819437 - Flags: review?(jhammel) → review+
Depends on: 929641
Depends on: 930439
Depends on: 930188
No longer depends on: 930188
Depends on: 931032
Depends on: 931342
Depends on: 932283
Depends on: 932389
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: