Closed Bug 1143727 Opened 10 years ago Closed 9 years ago

talos subclasses mozprocess and write to a log file - it doesn't need to

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: parkouss)

References

Details

Attachments

(1 file, 7 obsolete files)

talos subclasses mozprocess - it doesn't need to 10 years ago Julien Pagès (:parkouss) 13.39 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
talos subclasses mozprocess - it doesn't need to 10 years ago Julien Pagès (:parkouss) 13.26 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
test that browser_output log file is generated by ffprocess_remote 10 years ago Julien Pagès (:parkouss) 2.84 KB, patch		Details \| Diff \| Splinter Review
test removing outfile argument to the remove launchProcess method 10 years ago Julien Pagès (:parkouss) 3.39 KB, text/plain		Details
cleanup RemoteProcess.launchProcess 10 years ago Julien Pagès (:parkouss) 3.23 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
Remove the need for the browser log file. 10 years ago Julien Pagès (:parkouss) 13.49 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
1143727.patch 9 years ago Julien Pagès (:parkouss) 23.98 KB, patch	jmaher : review+	Details \| Diff \| Splinter Review
1143727.patch 9 years ago Julien Pagès (:parkouss) 24.11 KB, patch	parkouss : review+	Details \| Diff \| Splinter Review

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Description

•

10 years ago

in the talos repository, we have a file TalosProcess.py which is a subclass of mozprocess. There is not much logic in there and we don't need to subclass mozprocess. this current setup writes to a log file, then when the process is done we read the log file and parse the results. A few reasons for the logfile: 1) android (already does things much differently) 2) xperf (again, much different inputs/outputs) 3) investigating things locally Getting rid of the logfile might be interesting and probably beneficial. Getting rid of the subprocess should be done. We can take whatever logic is in there (timeout, outputhandler) and put them in ancillary functions in ttest.py. This came out of an irc discussion had between :jmaher and :wlach while discussing bug 1143094 (which landed).

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

10 years ago

:parkouss, do let me know if this bug interests you!

Flags: needinfo?(j.parkouss)

Julien Pagès (:parkouss)

Assignee

Comment 2

•

10 years ago

Sure, I look at it. :)

Assignee: nobody → j.parkouss

Flags: needinfo?(j.parkouss)

Julien Pagès (:parkouss)

Assignee

Comment 3

•

10 years ago

Attached patch talos subclasses mozprocess - it doesn't need to (obsolete) — Details — Splinter Review

So, this is my first attempt on this. I did not take care of getting rid of the log file. This can be done in a followup but for now I'm a bit lost when this log is useful and when it's not, so I just let things the way it was. Some comments here: - In ffsetup.py file, we used TalosProcess with a log file. I made the assumption that the log file here was only to save/exploit output, but not to kill the process when a given line is encountered. Tell me if it is ok. :) - I removed the time.sleep(5) after browser exit. It seems that it was here because of mozprocess not writing all output in the file. I think it was a bug resolved with bug 794984. Some more tests on this would be good though. :) Running Talos locally seems to work fine.

Attachment #8578583 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

10 years ago

Comment on attachment 8578583 [details] [diff] [review] talos subclasses mozprocess - it doesn't need to Review of attachment 8578583 [details] [diff] [review]: ----------------------------------------------------------------- ffsetup- we don't need to kill the browser based on failure messages! sleep(5) - good idea to get rid of it- most likely it was related to the file write. I will see if xperf has issues with it. ::: talos/ttest.py @@ +65,5 @@ > + "__endBeforeLaunchTimestamp\n" % self.firstTime) > + self.logfile.write("__startAfterTerminationTimestamp%d" > + "__endAfterTerminationTimestamp\n" > + % (int(time.time()) * 1000)) > + nit: remove tab/spaces on blank line @@ +523,5 @@ > + self.cm = self.CounterManager(browser_config['process'], self.counters) > + self.counter_results = dict([(counter, []) for counter in self.counters]) > + cmthread = Thread(target=self.collectCounters) > + cmthread.setDaemon(True) # don't hang on quit > + cmthread.start() maybe these counters could be outside of the "with open(...)" block @@ +534,5 @@ > + > + if browser.didTimeout: > + logfile.write("\n__FAILbrowser frozen__FAIL\n") > + raise TalosError("timeout") > + nit: remove whitespace from blank line.

Attachment #8578583 - Flags: review?(jmaher) → review+

Julien Pagès (:parkouss)

Assignee

Comment 5

•

10 years ago

Attached patch talos subclasses mozprocess - it doesn't need to (obsolete) — Details — Splinter Review

Ok, thanks for the review! I fixed the nits and moved the block of counters out of the with open() block. I ran again the tests locally, and this still seems to work fine. Joel, maybe you can have a quick look again at the patch to be sure I made no mistake. :)

Attachment #8578583 - Attachment is obsolete: true

Julien Pagès (:parkouss)

Assignee

Updated

•

10 years ago

Attachment #8578605 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

10 years ago

Attachment #8578605 - Attachment is patch: true

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

10 years ago

Attachment #8578605 - Flags: review?(jmaher) → review+

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

10 years ago

https://hg.mozilla.org/build/talos/rev/7c6a5e081ed7

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

10 years ago

this looks good! Lets leave this open as we move into the next adventure of the logfile stuff :)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

10 years ago

https://hg.mozilla.org/build/talos/rev/7b0e1eb947f9

Julien Pagès (:parkouss)

Assignee

Comment 9

•

10 years ago

Attached patch test that browser_output log file is generated by ffprocess_remote (obsolete) — Details — Splinter Review

Hello, I really doubt that the output log file can be generated with the ffprocess_remote.launchProcess function (https://github.com/mozilla/build-talos/blob/master/talos/ffprocess_remote.py#L113) I wrote a patch that remove any previous local log file before we execute those functions. I expect that TalosError will be raised: https://github.com/mozilla/build-talos/blob/master/talos/ffsetup.py#L140 or https://github.com/mozilla/build-talos/blob/master/talos/ttest.py#L600 In case browser_config['remote'] evaluates to True. If that's not the case, well there is some black magic that I don't understand :) I suspect that in case of remote, we reuse a local output file, and so we may not test something relevant. Joel, if that patch makes sense to you as a test, could you try it on Talos ? I don't want to merge this, I just want to understand how that works. :)

Julien Pagès (:parkouss)

Assignee

Comment 10

•

10 years ago

In bug 698898, Will attempted to use mozrunner for local and remote devices. I wonder if local is still needed, but for remote it could be a nice thing to have, as it seems that it could remove in ffprocess_remote.py some code that looks weird (see comment 9). The attached patch https://bug698898.bugzilla.mozilla.org/attachment.cgi?id=571146 in bug 698898 show this, but this needs to be investigated more because: from mozrunner.remoterunner import RemoteFennecRunner fail now with a "ImportError: No module named remoterunner" error...

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 11

•

10 years ago

we might not want to use mozrunner for talos. One thing we are doing in the coming months is getting talos off the panda boards. This might mean: * running on emulators * running on the autophone network * who knows, to be determined There is a chance that the remote stuff might not be around this summer, or it could change in scope/functionality. In the case we are not using talos for Android (i.e. all on autophone), we can remove a bunch of code in talos, likewise in some hybrid approach, we might be able to clean up some code. I pushed the patch from here and ran some android tests on try: tpn: http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/jmaher@mozilla.com-314726b1d21d/try-android-api-11/try_panda_android_test-remote-tp4m_nochrome-bm89-tests1-panda-build60.txt.gz tsvg: http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/jmaher@mozilla.com-314726b1d21d/try-android-api-11/try_panda_android_test-remote-tsvgx-bm100-tests1-panda-build88.txt.gz

William Lachance (:wlach)

Comment 12

•

10 years ago

(In reply to Julien Pagès from comment #10) > In bug 698898, Will attempted to use mozrunner for local and remote devices. > I wonder if local is still needed, but for remote it could be a nice thing > to have, as it seems that it could remove in ffprocess_remote.py some code > that looks weird (see comment 9). > > The attached patch > https://bug698898.bugzilla.mozilla.org/attachment.cgi?id=571146 in bug > 698898 show this, but this needs to be investigated more because: > > from mozrunner.remoterunner import RemoteFennecRunner > > fail now with a "ImportError: No module named remoterunner" error... Yeah I'd say that patch is completely out of date. :) If I were going to try to refactor this stuff today, I'd probably replace ffprocess_* with a locally-developed "runner" class which just provides just the abstractions Talos needs. This is the approach I ultimately took in Eideticker and it works pretty well. mozrunner is a pretty complex beast and I think it would be quite hard to port Talos to use it from scratch (and even then it wouldn't support Android).

Julien Pagès (:parkouss)

Assignee

Comment 13

•

10 years ago

Attached file test removing outfile argument to the remove launchProcess method (obsolete) — Details

Another patch test, still trying to understand how this works. I removed the outfile argument in RemoteProcess.launchProcess method (I still suspect this does not works) and added a log to see the complete remote command passed.

Attachment #8585202 - Attachment is obsolete: true

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 14

•

10 years ago

confusion resolved here- we set the filename as a preference talos.logfile, and when firefox loads up the extensions we load looks for talos.logfile as a preference and we write to the file :)

Julien Pagès (:parkouss)

Assignee

Comment 15

•

10 years ago

Attached patch cleanup RemoteProcess.launchProcess (obsolete) — Details — Splinter Review

So, this first patch only removes useless and tricky stuff, and add a comment. Hopefully we won't see black magic around this again. :)

Attachment #8585550 - Attachment is obsolete: true

Attachment #8585647 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

10 years ago

Attachment #8585647 - Flags: review?(jmaher) → review+

Julien Pagès (:parkouss)

Assignee

Comment 16

•

10 years ago

Attached patch Remove the need for the browser log file. (obsolete) — Details — Splinter Review

This second patch (note it needs to be applied after the first one) tries to remove the needs for the browser log file. talos runs fine locally.

Attachment #8585650 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 17

•

10 years ago

Comment on attachment 8585650 [details] [diff] [review] Remove the need for the browser log file. Review of attachment 8585650 [details] [diff] [review]: ----------------------------------------------------------------- this is nice- I need to test it, not 100% sure about the xperf side of the equation :)

Attachment #8585650 - Flags: review?(jmaher) → review+

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 18

•

10 years ago

https://hg.mozilla.org/build/talos/rev/c81cdb00be17 https://hg.mozilla.org/build/talos/rev/1afbdcd3391a

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 19

•

10 years ago

I had to back this set of patches out: http://hg.mozilla.org/build/talos/rev/16e23a6dde72 On windows 7 (non xperf) we have a 30-60% failure rate with errors looking like this: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmaher@mozilla.com-79e64c3f3eaa/try-win32/try_win7-ix_test-chromez-bm112-tests1-windows-build105.txt.gz 07:49:59 INFO - JavaScript error: resource:///modules/WebappManager.jsm, line 48: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver] 07:49:59 ERROR - JavaScript error: resource:///modules/CustomizableUI.jsm, line 1571: TypeError: aWindowPalette is undefined 07:50:01 INFO - DEBUG : Terminating: 1652 07:50:01 INFO - Failed tresize: 07:50:01 INFO - Stopped Wed, 01 Apr 2015 07:50:01 07:50:01 ERROR - Traceback (most recent call last): 07:50:01 INFO - File "C:\slave\test\build\venv\lib\site-packages\talos\run_tests.py", line 249, in run_tests 07:50:01 INFO - talos_results.add(mytest.runTest(browser_config, test)) 07:50:01 INFO - File "C:\slave\test\build\venv\lib\site-packages\talos\ttest.py", line 369, in runTest 07:50:01 INFO - self.initializeProfile(profile_dir, browser_config) 07:50:01 INFO - File "C:\slave\test\build\venv\lib\site-packages\talos\ttest.py", line 169, in initializeProfile 07:50:01 INFO - raise TalosError("browser failed to close after being initialized") 07:50:01 INFO - TalosError: browser failed to close after being initialized 07:50:01 ERROR - Return code: 2 07:50:01 ERROR - # TBPL FAILURE # this is only on windows 7 where this is a problem. I traced it down to this as the patch where the root cause was introduced: http://hg.mozilla.org/build/talos/rev/7c6a5e081ed7

Julien Pagès (:parkouss)

Assignee

Comment 20

•

9 years ago

So, the delicate part with TalosProcess is that it is waiting for '__enTimestamp' in the browser output, and if it find that it wait in a thread until the browser process is automatically finished (5 seconds max) - and if the browser is still alive it kills it (no error, execution continue). So I think we should clarify what we need here. Here is a rough grep: grep __endTimestamp `find talos -name '*.py' -o -name '*.js'` talos/results.py: ('startTime', ('__startTimestamp', '__endTimestamp')), talos/TalosProcess.py: After the browser prints __endTimestamp, we give it wait_for_quit_timeout talos/TalosProcess.py: if line.find('__endTimestamp') != -1: talos/pageloader/chrome/report.js: report += "__startTimestamp" + now + "__endTimestamp\n"; //timestamp for determning shutdown time, used by talos talos/startup_test/media/html/media_tests.js: dump("__startTimestamp" + results["BEGIN_TIME_STAMP"] + "__endTimestamp\n"); talos/startup_test/sessionrestore/main.js: "__endTimestamp\n\n"); talos/startup_test/tresize/addon/content/tresize.js: retVal.push('__startTimestamp' + Date.now() + '__endTimestamp\n'); talos/test.py: # provided by __startTimestamp/__endTimestamp). talos/test.py: # (__startTimestamp/__endTimestamp is still So what do talos needs to do when it sees '__enTimestamp' ? What is the meaning of that flag ? (not what Talos do with it, but what do we want to achieve) ? Well it is not really clear to me what Talos should do when it sees that. I wonder if we could rewrite that functionality (- or just remove it ?) so we won't have to do this tricky thing, but only something like: browser = ProcessHandler(command, timeout=TIMEOUT, ...) browser.wait() if browser.didTimeout: raise TalosError(...)

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 21

•

9 years ago

we parse the output to determine when the test is done, so __endTimestamp acts as a way to do that. When that is in the output we have completed the test, and we don't care about what else happens after that in the output or how the browser is closed down. When we don't see __endTimestamp, then we assume the browser didn't finish properly. This gets odd when we have the browser writing to a log file (sometimes buffered) and we read from it in the browser output process. does that help explain it?

Flags: needinfo?(jmaher)

Julien Pagès (:parkouss)

Assignee

Comment 22

•

9 years ago

Attached patch 1143727.patch (obsolete) — Details — Splinter Review

Yes thanks, this helps. :) Well, I reworked on this, made the browser_log obsolete and tried to simplify things around the code needed to run the browser. Note there is no more subclassing, and I removed that time.sleep(5). This time I am pretty confident - but this needs to be tested quite a lot on try, specially with windows jobs (as usual ;)). :)

Attachment #8578605 - Attachment is obsolete: true

Attachment #8585647 - Attachment is obsolete: true

Attachment #8585650 - Attachment is obsolete: true

Attachment #8647417 - Flags: review?(jmaher)

Julien Pagès (:parkouss)

Assignee

Comment 23

•

9 years ago

Re-run jobs in https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1b8cb5d5af2 (5 times more for each job)

Julien Pagès (:parkouss)

Assignee

Comment 24

•

9 years ago

So far, looks pretty good! But graph server seems a bit broken. When I look at the tp failure for win8, I can see: > No test_name called 'tp5o' can be found > File "/var/www/html/graphs/server/pyfomatic/collect.py", line 293, in handleRequest > metadata = MetaDataFromTalos(databaseCursor, databaseModule, inputStream) > File "/var/www/html/graphs/server/pyfomatic/collect.py", line 63, in __init__ > self.doDatabaseThings(databaseCursor) > File "/var/www/html/graphs/server/pyfomatic/collect.py", line 106, in doDatabaseThings > raise DatabaseException("No test_name called '%s' can be found" % self.test_name) (from http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/j.parkouss@gmail.com-e1b8cb5d5af2/try-win64/try_win8_64_test-tp5o-bm110-tests1-windows-build350.txt.gz) but this is only for one job, the other ones are good... Also, tp on linux64: > to determine geomean from 'test_run_values' for 50880468 - local variable 'values' referenced before assignment > File "/var/www/html/graphs/server/pyfomatic/collect.py", line 295, in handleRequest > geomean = valuesReader(databaseCursor, databaseModule, inputStream, metadata) > File "/var/www/html/graphs/server/pyfomatic/collect.py", line 238, in valuesReader > raise DatabaseException("to determine geomean from 'test_run_values' for %s - %s" % (metadata.test_run_id, str(x))) this time all builds failed this way. And this is all reported with the "talos.utils.TalosError: Graph server unreachable (5 attempts)" It really seems to me that these are issues from graphserver, looking at the log. Or do you think it is related to my changes in some way that I can't understand ?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 25

•

9 years ago

the real error is here: 03:05:32 INFO - No results collected for: tp5o_xres: when we upload no data for the counter (or a test), graph server throws an error. I am not sure why xres is causing troubles, could this be part of the counters work?

Julien Pagès (:parkouss)

Assignee

Comment 26

•

9 years ago

ah, nice catch. Yes, maybe. While working on the counters, I remember I was wondering if I should collect a last time after we stop collecting. https://bitbucket.org/parkouss/talos/commits/7fe8298e0aa272d8f241dee6b307ad35664fee62#Ltalos/cmanager.pyT68 I.E., do what we do in that while loop one more time after the loop end. Hmm, I guess we should try to reproduce first, then apply a patch for that, and try to not reproduce this time. Does that make sense ? Is this also the same kind of error for linux failures above ?

Julien Pagès (:parkouss)

Assignee

Comment 27

•

9 years ago

I re-triggered a bunch of tp jobs with that patch: https://bitbucket.org/parkouss/talos/commits/259a040bea20a0a95ef9e1ee00939b592b1a4422 No issues on Win8 - so this should do the trick. But linux is failing the same way (not intermittent, though). :jmaher, thoughts ?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 28

•

9 years ago

linux tp on inbound has data for xres, I suspect something you have changed is causing it to fail. I believe the counters are silent failures as they run in a different thread. let me look at this locally since I am working on counters :) If you have ideas for debugging it would be good to try that out.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 29

•

9 years ago

Comment on attachment 8647417 [details] [diff] [review] 1143727.patch Review of attachment 8647417 [details] [diff] [review]: ----------------------------------------------------------------- I wish I had some nits, lets fix the xrestop counter collection. I think we will be ok with default stuff, but what about e10s and pgo? The testing matrix is growing :)

Attachment #8647417 - Flags: review?(jmaher) → review+

Julien Pagès (:parkouss)

Assignee

Comment 30

•

9 years ago

Attached patch 1143727.patch — Details — Splinter Review

Bug found, fixed in this patch (thanks Joel!).

Attachment #8647417 - Attachment is obsolete: true

Attachment #8648090 - Flags: review+

Julien Pagès (:parkouss)

Assignee

Comment 31

•

9 years ago

Landed in https://hg.mozilla.org/build/talos/rev/55711f8341c5.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

9 years ago

Blocks: 1195291

You need to log in before you can comment on or make changes to this bug.