Closed Bug 1143727 Opened 9 years ago Closed 9 years ago

talos subclasses mozprocess and write to a log file - it doesn't need to

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: parkouss)

References

Details

Attachments

(1 file, 7 obsolete files)

in the talos repository, we have a file TalosProcess.py which is a subclass of mozprocess.  There is not much logic in there and we don't need to subclass mozprocess.

this current setup writes to a log file, then when the process is done we read the log file and parse the results.

A few reasons for the logfile:
1) android (already does things much differently)
2) xperf (again, much different inputs/outputs)
3) investigating things locally

Getting rid of the logfile might be interesting and probably beneficial.  Getting rid of the subprocess should be done.  We can take whatever logic is in there (timeout, outputhandler) and put them in ancillary functions in ttest.py.

This came out of an irc discussion had between :jmaher and :wlach while discussing bug 1143094 (which landed).
:parkouss, do let me know if this bug interests you!
Flags: needinfo?(j.parkouss)
Sure, I look at it. :)
Assignee: nobody → j.parkouss
Flags: needinfo?(j.parkouss)
So, this is my first attempt on this.

I did not take care of getting rid of the log file. This can be done in a followup but for now I'm a bit lost when this log is useful and when it's not, so I just let things the way it was.

Some comments here:

- In ffsetup.py file, we used TalosProcess with a log file. I made the assumption that the log file here was only to save/exploit output, but not to kill the process when a given line is encountered. Tell me if it is ok. :)

- I removed the time.sleep(5) after browser exit. It seems that it was here because of mozprocess not writing all output in the file. I think it was a bug resolved with bug 794984. Some more tests on this would be good though. :)

Running Talos locally seems to work fine.
Attachment #8578583 - Flags: review?(jmaher)
Comment on attachment 8578583 [details] [diff] [review]
talos subclasses mozprocess - it doesn't need to

Review of attachment 8578583 [details] [diff] [review]:
-----------------------------------------------------------------

ffsetup- we don't need to kill the browser based on failure messages!
sleep(5) - good idea to get rid of it- most likely it was related to the file write.  I will see if xperf has issues with it.

::: talos/ttest.py
@@ +65,5 @@
> +                           "__endBeforeLaunchTimestamp\n" % self.firstTime)
> +        self.logfile.write("__startAfterTerminationTimestamp%d"
> +                           "__endAfterTerminationTimestamp\n"
> +                           % (int(time.time()) * 1000))
> +        

nit: remove tab/spaces on blank line

@@ +523,5 @@
> +                            self.cm = self.CounterManager(browser_config['process'], self.counters)
> +                            self.counter_results = dict([(counter, []) for counter in self.counters])
> +                            cmthread = Thread(target=self.collectCounters)
> +                            cmthread.setDaemon(True) # don't hang on quit
> +                            cmthread.start()

maybe these counters could be outside of the "with open(...)" block

@@ +534,5 @@
> +
> +                        if browser.didTimeout:
> +                            logfile.write("\n__FAILbrowser frozen__FAIL\n")
> +                            raise TalosError("timeout")
> +                        

nit: remove whitespace from blank line.
Attachment #8578583 - Flags: review?(jmaher) → review+
Ok, thanks for the review!

I fixed the nits and moved the block of counters out of the with open() block.

I ran again the tests locally, and this still seems to work fine.

Joel, maybe you can have a quick look again at the patch to be sure I made no mistake. :)
Attachment #8578583 - Attachment is obsolete: true
Attachment #8578605 - Flags: review?(jmaher)
Attachment #8578605 - Attachment is patch: true
Attachment #8578605 - Flags: review?(jmaher) → review+
this looks good!  Lets leave this open as we move into the next adventure of the logfile stuff :)
Hello,

I really doubt that the output log file can be generated with the ffprocess_remote.launchProcess function (https://github.com/mozilla/build-talos/blob/master/talos/ffprocess_remote.py#L113)

I wrote a patch that remove any previous local log file before we execute those functions. I expect that TalosError will be raised:

https://github.com/mozilla/build-talos/blob/master/talos/ffsetup.py#L140
or
https://github.com/mozilla/build-talos/blob/master/talos/ttest.py#L600

In case browser_config['remote'] evaluates to True. If that's not the case, well there is some black magic that I don't understand :)

I suspect that in case of remote, we reuse a local output file, and so we may not test something relevant.

Joel, if that patch makes sense to you as a test, could you try it on Talos ? I don't want to merge this, I just want to understand how that works. :)
In bug 698898, Will attempted to use mozrunner for local and remote devices. I wonder if local is still needed, but for remote it could be a nice thing to have, as it seems that it could remove in ffprocess_remote.py some code that looks weird (see comment 9).

The attached patch https://bug698898.bugzilla.mozilla.org/attachment.cgi?id=571146 in bug 698898 show this, but this needs to be investigated more because:

from mozrunner.remoterunner import RemoteFennecRunner

fail now with a "ImportError: No module named remoterunner" error...
we might not want to use mozrunner for talos.  One thing we are doing in the coming months is getting talos off the panda boards.  This might mean:
* running on emulators
* running on the autophone network
* who knows, to be determined

There is a chance that the remote stuff might not be around this summer, or it could change in scope/functionality.  In the case we are not using talos for Android (i.e. all on autophone), we can remove a bunch of code in talos, likewise in some hybrid approach, we might be able to clean up some code.

I pushed the patch from here and ran some android tests on try:
tpn: http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/jmaher@mozilla.com-314726b1d21d/try-android-api-11/try_panda_android_test-remote-tp4m_nochrome-bm89-tests1-panda-build60.txt.gz
tsvg: http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/jmaher@mozilla.com-314726b1d21d/try-android-api-11/try_panda_android_test-remote-tsvgx-bm100-tests1-panda-build88.txt.gz
(In reply to Julien Pagès from comment #10)
> In bug 698898, Will attempted to use mozrunner for local and remote devices.
> I wonder if local is still needed, but for remote it could be a nice thing
> to have, as it seems that it could remove in ffprocess_remote.py some code
> that looks weird (see comment 9).
> 
> The attached patch
> https://bug698898.bugzilla.mozilla.org/attachment.cgi?id=571146 in bug
> 698898 show this, but this needs to be investigated more because:
> 
> from mozrunner.remoterunner import RemoteFennecRunner
> 
> fail now with a "ImportError: No module named remoterunner" error...

Yeah I'd say that patch is completely out of date. :)

If I were going to try to refactor this stuff today, I'd probably replace ffprocess_* with a locally-developed "runner" class which just provides just the abstractions Talos needs. This is the approach I ultimately took in Eideticker and it works pretty well. mozrunner is a pretty complex beast and I think it would be quite hard to port Talos to use it from scratch (and even then it wouldn't support Android).
Another patch test, still trying to understand how this works. I removed the outfile argument in RemoteProcess.launchProcess method (I still suspect this does not works) and added a log to see the complete remote command passed.
Attachment #8585202 - Attachment is obsolete: true
confusion resolved here- we set the filename as a preference talos.logfile, and when firefox loads up the extensions we load looks for talos.logfile as a preference and we write to the file :)
So, this first patch only removes useless and tricky stuff, and add a comment. Hopefully we won't see black magic around this again. :)
Attachment #8585550 - Attachment is obsolete: true
Attachment #8585647 - Flags: review?(jmaher)
Attachment #8585647 - Flags: review?(jmaher) → review+
This second patch (note it needs to be applied after the first one) tries to remove the needs for the browser log file.

talos runs fine locally.
Attachment #8585650 - Flags: review?(jmaher)
Comment on attachment 8585650 [details] [diff] [review]
Remove the need for the browser log file.

Review of attachment 8585650 [details] [diff] [review]:
-----------------------------------------------------------------

this is nice- I need to test it, not 100% sure about the xperf side of the equation :)
Attachment #8585650 - Flags: review?(jmaher) → review+
I had to back this set of patches out:
http://hg.mozilla.org/build/talos/rev/16e23a6dde72

On windows 7 (non xperf) we have a 30-60% failure rate with errors looking like this:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmaher@mozilla.com-79e64c3f3eaa/try-win32/try_win7-ix_test-chromez-bm112-tests1-windows-build105.txt.gz

07:49:59     INFO -  JavaScript error: resource:///modules/WebappManager.jsm, line 48: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]
07:49:59    ERROR -  JavaScript error: resource:///modules/CustomizableUI.jsm, line 1571: TypeError: aWindowPalette is undefined
07:50:01     INFO -  DEBUG : Terminating: 1652
07:50:01     INFO -  Failed tresize:
07:50:01     INFO -  		Stopped Wed, 01 Apr 2015 07:50:01
07:50:01    ERROR -  Traceback (most recent call last):
07:50:01     INFO -    File "C:\slave\test\build\venv\lib\site-packages\talos\run_tests.py", line 249, in run_tests
07:50:01     INFO -      talos_results.add(mytest.runTest(browser_config, test))
07:50:01     INFO -    File "C:\slave\test\build\venv\lib\site-packages\talos\ttest.py", line 369, in runTest
07:50:01     INFO -      self.initializeProfile(profile_dir, browser_config)
07:50:01     INFO -    File "C:\slave\test\build\venv\lib\site-packages\talos\ttest.py", line 169, in initializeProfile
07:50:01     INFO -      raise TalosError("browser failed to close after being initialized")
07:50:01     INFO -  TalosError: browser failed to close after being initialized
07:50:01    ERROR - Return code: 2
07:50:01    ERROR - # TBPL FAILURE #

this is only on windows 7 where this is a problem.

I traced it down to this as the patch where the root cause was introduced:
http://hg.mozilla.org/build/talos/rev/7c6a5e081ed7
So, the delicate part with TalosProcess is that it is waiting for '__enTimestamp' in the browser output, and if it find that it wait in a thread until the browser process is automatically finished (5 seconds max) - and if the browser is still alive it kills it (no error, execution continue).

So I think we should clarify what we need here. Here is a rough grep:

grep __endTimestamp `find talos -name '*.py' -o -name '*.js'`
talos/results.py:        ('startTime', ('__startTimestamp', '__endTimestamp')),
talos/TalosProcess.py:    After the browser prints __endTimestamp, we give it wait_for_quit_timeout
talos/TalosProcess.py:        if line.find('__endTimestamp') != -1:
talos/pageloader/chrome/report.js:  report += "__startTimestamp" + now + "__endTimestamp\n"; //timestamp for determning shutdown time, used by talos
talos/startup_test/media/html/media_tests.js:    dump("__startTimestamp" + results["BEGIN_TIME_STAMP"] + "__endTimestamp\n");
talos/startup_test/sessionrestore/main.js:            "__endTimestamp\n\n");
talos/startup_test/tresize/addon/content/tresize.js:    retVal.push('__startTimestamp' + Date.now() + '__endTimestamp\n');
talos/test.py:                         # provided by __startTimestamp/__endTimestamp).
talos/test.py:                         # (__startTimestamp/__endTimestamp is still


So what do talos needs to do when it sees '__enTimestamp' ? What is the meaning of that flag ? (not what Talos do with it, but what do we want to achieve) ?

Well it is not really clear to me what Talos should do when it sees that. I wonder if we could rewrite that functionality (- or just remove it ?) so we won't have to do this tricky thing, but only something like:

browser = ProcessHandler(command, timeout=TIMEOUT, ...)
browser.wait()
if browser.didTimeout:
    raise TalosError(...)
Flags: needinfo?(jmaher)
we parse the output to determine when the test is done, so __endTimestamp acts as a way to do that.  When that is in the output we have completed the test, and we don't care about what else happens after that in the output or how the browser is closed down.

When we don't see __endTimestamp, then we assume the browser didn't finish properly.  This gets odd when we have the browser writing to a log file (sometimes buffered) and we read from it in the browser output process.

does that help explain it?
Flags: needinfo?(jmaher)
Attached patch 1143727.patch (obsolete) — Splinter Review
Yes thanks, this helps. :)

Well, I reworked on this, made the browser_log obsolete and tried to simplify things around the code needed to run the browser.

Note there is no more subclassing, and I removed that time.sleep(5). This time I am pretty confident - but this needs to be tested quite a lot on try, specially with windows jobs (as usual ;)). :)
Attachment #8578605 - Attachment is obsolete: true
Attachment #8585647 - Attachment is obsolete: true
Attachment #8585650 - Attachment is obsolete: true
Attachment #8647417 - Flags: review?(jmaher)
Re-run jobs in https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1b8cb5d5af2 (5 times more for each job)
So far, looks pretty good!

But graph server seems a bit broken. When I look at the tp failure for win8, I can see:

> No test_name called 'tp5o' can be found
>   File "/var/www/html/graphs/server/pyfomatic/collect.py", line 293, in handleRequest
>     metadata = MetaDataFromTalos(databaseCursor, databaseModule, inputStream)
>   File "/var/www/html/graphs/server/pyfomatic/collect.py", line 63, in __init__
>     self.doDatabaseThings(databaseCursor)
>   File "/var/www/html/graphs/server/pyfomatic/collect.py", line 106, in doDatabaseThings
>     raise DatabaseException("No test_name called '%s' can be found" % self.test_name)

(from http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/j.parkouss@gmail.com-e1b8cb5d5af2/try-win64/try_win8_64_test-tp5o-bm110-tests1-windows-build350.txt.gz) but this is only for one job, the other ones are good...

Also, tp on linux64:

> to determine geomean from 'test_run_values' for  50880468 - local variable 'values' referenced before assignment
>   File "/var/www/html/graphs/server/pyfomatic/collect.py", line 295, in handleRequest
>     geomean = valuesReader(databaseCursor, databaseModule, inputStream, metadata)
>   File "/var/www/html/graphs/server/pyfomatic/collect.py", line 238, in valuesReader
>     raise DatabaseException("to determine geomean from 'test_run_values' for  %s - %s" % (metadata.test_run_id, str(x)))

this time all builds failed this way.

And this is all reported with the "talos.utils.TalosError: Graph server unreachable (5 attempts)"

It really seems to me that these are issues from graphserver, looking at the log. Or do you think it is related to my changes in some way that I can't understand ?
the real error is here:
03:05:32     INFO -  No results collected for: tp5o_xres:


when we upload no data for the counter (or a test), graph server throws an error.  I am not sure why xres is causing troubles, could this be part of the counters work?
ah, nice catch. Yes, maybe. While working on the counters, I remember I was wondering if I should collect a last time after we stop collecting.

https://bitbucket.org/parkouss/talos/commits/7fe8298e0aa272d8f241dee6b307ad35664fee62#Ltalos/cmanager.pyT68

I.E., do what we do in that while loop one more time after the loop end. Hmm, I guess we should try to reproduce first, then apply a patch for that, and try to not reproduce this time.

Does that make sense ?

Is this also the same kind of error for linux failures above ?
I re-triggered a bunch of tp jobs with that patch:

https://bitbucket.org/parkouss/talos/commits/259a040bea20a0a95ef9e1ee00939b592b1a4422

No issues on Win8 - so this should do the trick.

But linux is failing the same way (not intermittent, though).

:jmaher, thoughts ?
linux tp on inbound has data for xres, I suspect something you have changed is causing it to fail.  I believe the counters are silent failures as they run in a different thread.

let me look at this locally since I am working on counters :)  If you have ideas for debugging it would be good to try that out.
Comment on attachment 8647417 [details] [diff] [review]
1143727.patch

Review of attachment 8647417 [details] [diff] [review]:
-----------------------------------------------------------------

I wish I had some nits, lets fix the xrestop counter collection.

I think we will be ok with default stuff, but what about e10s and pgo?  The testing matrix is growing :)
Attachment #8647417 - Flags: review?(jmaher) → review+
Attached patch 1143727.patchSplinter Review
Bug found, fixed in this patch (thanks Joel!).
Attachment #8647417 - Attachment is obsolete: true
Attachment #8648090 - Flags: review+
Landed in https://hg.mozilla.org/build/talos/rev/55711f8341c5.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1195291
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: