Closed Bug 923770 Opened 11 years ago Closed 11 years ago

Firefox doesn't close properly in some cases, nor does bcontroller, causing timout exceeded messages

Categories

(Testing :: Talos, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(2 files, 2 obsolete files)

force hung browser/bcontroller.py to quit (1.0) 11 years ago Joel Maher ( :jmaher ) (UTC -8) 2.44 KB, patch	k0scist : review+	Details \| Diff \| Splinter Review
support mozprocess in talos (1.0) 11 years ago Joel Maher ( :jmaher ) (UTC -8) 55.89 KB, patch		Details \| Diff \| Splinter Review
support mozprocess in talos (1.1) 11 years ago Joel Maher ( :jmaher ) (UTC -8) 56.90 KB, patch	k0scist : review+	Details \| Diff \| Splinter Review
fix a few nits to make mobile work (1.0) 11 years ago Joel Maher ( :jmaher ) (UTC -8) 4.67 KB, patch	k0scist : review+	Details \| Diff \| Splinter Review

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Description

•

11 years ago

My favorite bug 798219 has moved up in priority. In fact, this bug is next to impossible to reproduce locally. By doing try server debugging, I found that we were going through our code to quit, but we were not actually terminating the browser. While this is a problem in and of itself, there is yet another problem: after finding the end of the test (via log file parsing, not process termination), sys.exit() for bcontroller failed to terminate the python process. We ended up with a timeout in ttest.py (the launch of bcontroller.py.) So given two problems, they must be related somehow. I have a solution which appears to work around this problem although I am not understanding the problem fully. It is probably true that there is some correlation between the inability to terminate firefox and bcontroller.py. I suspect a bug in python or the operating system.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 1

•

11 years ago

Attached patch force hung browser/bcontroller.py to quit (1.0) (obsolete) — Details — Splinter Review

I am open to other feedback if there is a better way to solve this bug. Really this is a first pass at solving the problem- possibly an OK way to do it. This does 2 things: 1) if we detect no activity in the log file (already in place), talos will check for our __startTimestamp*__endTimestamp message that all tests put at the end of the log file. If it is found, we terminate nicely 2) use os._exit() instead of sys.exit()

Assignee: nobody → jmaher

Status: NEW → ASSIGNED

Attachment #813833 - Flags: review?(jhammel)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 2

•

11 years ago

for the record, this doesn't solve all the timeouts, but it does greatly reduce them. It appears we still fail to complete the test in many instances and then timeout.

Comment 3

•

11 years ago

Comment on attachment 813833 [details] [diff] [review] force hung browser/bcontroller.py to quit (1.0) Review of attachment 813833 [details] [diff] [review]: ----------------------------------------------------------------- Outside of nits, looks fine. Should think of moving to mozrunner/mozprocess at some point of time. ::: talos/bcontroller.py @@ +163,5 @@ > def run(self): > self.bwaiter = BrowserWaiter(self.remoteProcess, **self.options) > noise = 0 > prev_size = 0 > + endre = re.compile('.*__startTimestamp.*__endTimestamp.*', re.MULTILINE|re.DOTALL) This may want to live outside of method scope to avoid re compiling multiple times. Bonus for making this a utility. FWIW, regexs may be an overkill since e.g. str.index may be used to find these stamps. @@ +193,5 @@ > + if os.path.isfile(self.browser_log): > + with open(self.browser_log, 'r') as bfile: > + bdata = bfile.read() > + if endre.match(bdata): > + noise = -1 Why -1? @@ +200,3 @@ > > results_file = open(self.browser_log, "a") > + if self.bwaiter.getReturn() != 0 and noise >= 0: #the browser shutdown, but not cleanly no need for `!= 0` @@ +235,4 @@ > options.get('browser_wait', -1) <= 0 or \ > len(options.get('browser_log', '')) < 3): > print >> sys.stderr, "FAIL: incorrect parameters to bcontroller\n" > + return -1 Should document the return codes and what they mean @@ +243,4 @@ > > if __name__ == "__main__": > + os._exit(main()) > + what does os._exit buy that sys.exit does not? Should document this e.g. as a comment. Generally, _-prefaced methods should be avoided where possible except as last resort

Attachment #813833 - Flags: review?(jhammel) → review+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 4

•

11 years ago

I had overlooked the xperf case. I can add YAUH (yet another ugly hack), but I am starting to think this needs a bit more thought. With the oncoming webrtc talos test case in bug 909524, there is a need for a pre/post test process (this could be a subprocess, or some type of routine). My initial thoughts are to move bcontroller.py into a setup where we do our setup, then run our process with the timeout/logwatcher/etc. then when that is done, we run the cleanup code. In fact, this setup/cleanup could happen outside of bcontroller and managed via ttest.py. Of course migrating to mozprocess might be a worthwhile endeavor at this point in time. wlach, jhammel: I am open to suggestions. Hopefully something to simplify this process and allow for a simpler way to define setup/cleanup.

Flags: needinfo?(wlachance)

Flags: needinfo?(jhammel)

William Lachance (:wlach)

Comment 5

•

11 years ago

I'd do as little as possible in bcontroller, as it's hard to understand/debug what it's doing. Doing the setup/cleanup inside the main talos process (e.g. ttest.py) sounds good. I agree with :jhammel that migration to mozrunner/mozprocess is ultimately what we want. We really shouldn't need to launch a seperate "bcontroller" process at all (none of the other harnesses do anything like it).

Flags: needinfo?(wlachance)

Comment 6

•

11 years ago

Not surprisingly, I agree with Will: I'd move towards eliminating logic in bcontroller, eliminating launching it as a subprocess (I forget if there is any reason we don't just import it directly?), and would work on moving towards a similar model as our other test harnesses. Moving to mozprocess/mozrunner at this point in time might be a good idea (I would advocate it, though I can't really say I can estimate all the costs). However, this is all easy to say, and there are probably 1000 little nits that make doing so challenging. Unless this is a Q4 goal, its hard to guess (for me, anyway) the time/necessity to put into doing this right vs. the alternatives. In short, your plan sounds good, Joel, and a move in the right direction.

Flags: needinfo?(jhammel)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 7

•

11 years ago

Attached patch support mozprocess in talos (1.0) (obsolete) — Details — Splinter Review

this support mozprocess. I still need to get some additional testing on this (i.e. 200 retriggers on failing platforms, etc.), but for now this is pretty stable and green on my try push: https://tbpl.mozilla.org/?tree=Try&rev=bf0261dd8515&showall=1 This patch adds some support for setup/cleanup scripts (needed for xperf). In addition, this removes bcontroller, and the majority of the functionality in ffprocess_*.py. Mobile also has a much simpler story. Please scrutinize this in full detail. As for secondary reviews, etc. In the future we could cleanup more of ffprocess bits, also some of the remote management could be pulled out of ffprocess and coalesced into a single remoteProcess api which encapsulates devicemanager.

Attachment #813833 - Attachment is obsolete: true

Attachment #816017 - Flags: review?(jhammel)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 8

•

11 years ago

Attached patch support mozprocess in talos (1.1) — Details — Splinter Review

fixed: * typo in cmanager_mac.py * added timeout to .run() command * fixed default timeouts for various tests (i.e. no more 1 hour timeouts) * removed some debugging messages this is up on try server: https://tbpl.mozilla.org/?tree=Try&rev=bf0261dd8515&showall=1 tsvgx is working great (the root cause for fixing this), but tp5 on 10.6/10.7 is having troubles, unrelated to mozprocess, but a more general issue in the last couple weeks on osx and talos/mochitest.

Attachment #816017 - Attachment is obsolete: true

Attachment #816017 - Flags: review?(jhammel)

Attachment #816512 - Flags: review?(jhammel)

Maire Reavy [:mreavy]

Updated

•

11 years ago

Blocks: 909524

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Blocks: 927312

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Blocks: 924956

Comment 10

•

11 years ago

Comment on attachment 816512 [details] [diff] [review] support mozprocess in talos (1.1) Review of attachment 816512 [details] [diff] [review]: ----------------------------------------------------------------- a somewhat wary r+. Tons of good stuff here...a few nits....and a few things that look good but that raise my hair probably due to talosphobia and past allergies. Good to go with nits fixed and confidence in try. Otherwise, I'm happy to do another round of review, this time in much shorter order. ::: talos/bcontroller.py @@ -1,2 @@ > -#!/usr/bin/env python > - kill it! ::: talos/cmanager.py @@ +6,4 @@ > > counterDict = {} > > + def __init__(self, process, counters=None, childProcess="plugin-container"): I realize that this is all legacy, but it is unnecessary to pass these items to the base class if they are not used. ::: talos/cmanager_linux.py @@ +162,5 @@ > counters: A list of counters to monitor. Any counters whose name does > not match a key in 'counterDict' will be ignored. > """ > > + CounterManager.__init__(self, process, counters) See comment in cmanager.py @@ +167,4 @@ > self.childProcess = childProcess > self.runThread = False > self.pidList = [] > + self.primaryPid = mozpid.get_pids(process)[-1] Worth checking multiple pids? ::: talos/cmanager_mac.py @@ +73,5 @@ > def getCounterValue(self, counterName): > """Returns the last value of the counter 'counterName'""" > + if counterName not in self.registeredCounters: > + return > + Silent fail....sigh. Maybe worth printing an error? ::: talos/ffprocess.py @@ +71,3 @@ > > return terminate_result > I certainly like to see this gone. ::: talos/ffprocess_remote.py @@ +255,5 @@ > + raise talosError("missing data from remote log file") > + > + # Wait out the browser closing > + time.sleep(browser_config['browser_wait']) > + This I'll leave up to your scrutiny. Looks fine, but I haven't tested or know this path well ::: talos/ffprocess_win32.py @@ +18,1 @@ > Would be nice to kill these files entirely someday. A man can dream.... ::: talos/results.py @@ +310,4 @@ > try: > + with open(filename, 'r') as f: > + results_raw = f.read() > + except Exception, e: should definitely be `except Exception, exception:` or `except: raise` Please fix before checkin. (You could probably just leave out the try: except clause entirely) ::: talos/talosProcess.py @@ +10,5 @@ > +from utils import talosError > + > +class talosProcess(ProcessHandler): > + """ > + Process handler for running peptests peptests? @@ +14,5 @@ > + Process handler for running peptests > + """ > + def __init__(self, cmd, > + args=None, cwd=None, > + env=os.environ.copy(), default arguments shouldn't be mutable @@ +30,5 @@ > + if not self.logfile: > + return > + > + if not self.results_file: > + self.results_file = open(self.logfile, 'w') I'd probably open/close on each call or flush to ensure that log messages get written @@ +40,5 @@ > + self.results_file.close() > + > + def waitForQuit(self, timeout=5): > + for i in range(1, timeout): > + if self.proc.returncode != None: is not None @@ +57,5 @@ > + """ > + When we timeout, dictate this in the log file. > + """ > + if os.path.isfile(self.logfile): > + os.chmod(self.logfile, 0777) pls document why @@ +72,5 @@ > + thread = Thread(target=self.waitForQuit) > + thread.setDaemon(True) # don't hang on quit > + thread.start() > + print line > + self.logToFile(line + "\n") Overall looks good! ::: talos/test.py @@ +278,4 @@ > xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_normal_fileio', 'nonmain_normal_netio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes'] > mobile = False # too many files to run, we will hit OOM > filters = [["ignore_first", [5]], ['median', []]] > + timeout = 1800 Not sure why the timeouts are changing? ::: talos/ttest.py @@ +355,5 @@ > + if not browser_config['remote']: > + if test_config['setup']: > + # Generate bcontroller.yml for xperf > + utils.GenerateTalosConfig(command_args, browser_config, test_config) > + # TODO: define this script better What does this mean? @@ +356,5 @@ > + if test_config['setup']: > + # Generate bcontroller.yml for xperf > + utils.GenerateTalosConfig(command_args, browser_config, test_config) > + # TODO: define this script better > + setup = talosProcess.talosProcess(['python'] + test_config['setup'].split(' ')) Why ' '? @@ +376,3 @@ > > + # todo: how do we throw a timeout error > + # todo: ctrl+c doesn't close the browser windows Please file on landing? ::: talos/utils.py @@ +123,3 @@ > def is_running(pid, psarg='axwww'): > """returns if a pid is running""" > + return bool([i for i in mozpid.ps(psarg) if pid == int(i['PID'])]) So we should be careful that 1. the resulting consolidated code is best and more generic than the two different parts and; 2.(IMHO) that the resultant code should live in mozbase (specifically, e.g. mozbase.pid). The code is close but doesn't entirely match: ``talos.utils`` http://hg.mozilla.org/build/talos/file/ed35a6799e25/talos/utils.py#l121 def _parse_ps(): """parse the output of the ps command""" # -> where are the tests? def ps(arg='axwww'): """ python front-end to `ps` http://en.wikipedia.org/wiki/Ps_%28Unix%29 """ def is_running(pid, psarg='axwww'): """returns if a pid is running""" def running_processes(name, psarg='axwww', defunct=False): """ returns a list of 2-tuples of running processes: (pid, ['path/to/executable', 'args', '...']) with the executable named `name`. - defunct: whether to return defunct processes """ ---- ``mozprocess.pid`` # https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/pid.py#L21 def ps(arg=psarg): """ python front-end to `ps` http://en.wikipedia.org/wiki/Ps_%28Unix%29 returns a list of process dicts based on the `ps` header """ # https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/pid.py#L42 def running_processes(name, psarg=psarg, defunct=True): """ returns a list of {'PID': PID of process (int) 'command': command line of process (list)} with the executable named `name`. - defunct: whether to return defunct processes """ # https://github.com/mozilla/mozbase/blob/master/mozprocess/mozprocess/pid.py#L73 def get_pids(name): """Get all the pids matching name""" This doesn't have to be done here, but should be done as a follow up. @@ +245,4 @@ > except ValueError: > return value > > +def GenerateTalosConfig(command_line, browser_config, test_config, pid=None): I'd rather have this broken out into a separate file. Just a nit, not a blocker. ::: talos/xtalos/xtalos.py @@ +45,4 @@ > > self.add_option("-e", "--etl_filename", dest="etl_filename", > help = "Name of the .etl file to work with. Defaults to 'output.etl'") > + defaults["etl_filename"] = "test.etl" changes this file herein seem unrelated but fine (AFAIK)

Attachment #816512 - Flags: review?(jhammel) → review+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 928328

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 928326

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 928324

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 928323

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 925408

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 11

•

11 years ago

Comment on attachment 816512 [details] [diff] [review] support mozprocess in talos (1.1) Review of attachment 816512 [details] [diff] [review]: ----------------------------------------------------------------- thanks for the review, I have made changes requested, filed bugs as needed and answered your questions below. pushed to try for final verification: https://tbpl.mozilla.org/?tree=Try&rev=44b6b2d9aa10 ::: talos/cmanager_linux.py @@ +167,4 @@ > self.childProcess = childProcess > self.runThread = False > self.pidList = [] > + self.primaryPid = mozpid.get_pids(process)[-1] as the variably is named, primaryPid this is a single master process pid. I kept parity with the existing code. As for a future enhancement I am not sure if the firefox process will have multiple pids, as it stands today child processes will be a different name. ::: talos/test.py @@ +278,4 @@ > xperf_counters = ['main_startup_fileio', 'main_startup_netio', 'main_normal_fileio', 'main_normal_netio', 'nonmain_startup_fileio', 'nonmain_normal_fileio', 'nonmain_normal_netio', 'mainthread_readcount', 'mainthread_readbytes', 'mainthread_writecount', 'mainthread_writebytes'] > mobile = False # too many files to run, we will hit OOM > filters = [["ignore_first", [5]], ['median', []]] > + timeout = 1800 timeouts are changing because they are way too high. ::: talos/ttest.py @@ +355,5 @@ > + if not browser_config['remote']: > + if test_config['setup']: > + # Generate bcontroller.yml for xperf > + utils.GenerateTalosConfig(command_args, browser_config, test_config) > + # TODO: define this script better filed bug 928323 for better documentation on test integration points. @@ +356,5 @@ > + if test_config['setup']: > + # Generate bcontroller.yml for xperf > + utils.GenerateTalosConfig(command_args, browser_config, test_config) > + # TODO: define this script better > + setup = talosProcess.talosProcess(['python'] + test_config['setup'].split(' ')) bad coding on my part @@ +376,3 @@ > > + # todo: how do we throw a timeout error > + # todo: ctrl+c doesn't close the browser windows filed bug 928324 ::: talos/utils.py @@ +123,3 @@ > def is_running(pid, psarg='axwww'): > """returns if a pid is running""" > + return bool([i for i in mozpid.ps(psarg) if pid == int(i['PID'])]) filed bug 928328 to track the verification and unification of process listing. @@ +245,4 @@ > except ValueError: > return value > > +def GenerateTalosConfig(command_line, browser_config, test_config, pid=None): filed bug 928326 to move this into a separate file.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 12

•

11 years ago

https://hg.mozilla.org/build/talos/rev/18ede41362ab

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 928370

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 13

•

11 years ago

this was backed out of inbound for breaking all mobile tests. With the changing landscape of running tests I had overlooked uploading a new talos.zip

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 14

•

11 years ago

Attached patch fix a few nits to make mobile work (1.0) — Details — Splinter Review

Attachment #819437 - Flags: review?(jhammel)

Comment 15

•

11 years ago

Comment on attachment 819437 [details] [diff] [review] fix a few nits to make mobile work (1.0) Review of attachment 819437 [details] [diff] [review]: ----------------------------------------------------------------- lgtm

Attachment #819437 - Flags: review?(jhammel) → review+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 16

•

11 years ago

http://hg.mozilla.org/build/talos/rev/49fe0b4c3014

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 929641

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 930439

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 930188

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

No longer depends on: 930188

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 931032

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 931342

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 932283

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Depends on: 932389

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.