Closed Bug 1193385 Opened 9 years ago Closed 9 years ago

Test summarization for subtests gives incorrect results for tresize (and possibly other tests)

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wlach, Assigned: jmaher)

References

Details

Attachments

(2 files, 2 obsolete files)

See here for example: https://treeherder.mozilla.org/logviewer.html#?job_id=12743279&repo=mozilla-inbound http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1439303977/mozilla-inbound_ubuntu64_hw_test-chromez-bm103-tests1-linux-build3881.txt.gz Despite having these replicates: 17.76164999999998, 20.503433333333373, 21.181566666666647, 18.56949999999996, 16.985500000000034, 17.831733333333343, 21.00719999999998, 18.389749999999985, 18.190666666666623, 17.328116666666666, 17.031466666666667, 19.077733333333324, 17.125199999999996, 16.616666666666674, 16.775350000000003, 20.67286666666666, 18.896600000000017, 18.486100000000025, 19.525000000000016, 19.44226666666668 We have these results for min/max/std/med/mean: {"std": 0.0, "min": 19.44226666666668, "max": 19.44226666666668, "median": 19.44226666666668, "filtered": 19.44226666666668, "mean": 19.44226666666668} Something is clearly off
From Joel on IRC: this bug probably applies to all startup tests.
Summary: Test summarization for subtests for tresize (at least) → Test summarization for subtests gives incorrect results for tresize (and possibly other tests)
Blocks: 1194333
this is really an issue of --cycles >1, but we only do that for startup_tests, and only have a problem (I believe so) with tresize,ts_paint,sessionrestore*. These tests all have a pagename NULL. This is a hack, I would like to have the tests spit out a common format and we can just simplify our parsing overall. This is where I am, maybe some other eyes can help shed light on a more elegant solution.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8648066 - Flags: feedback?(j.parkouss)
Comment on attachment 8648066 [details] [diff] [review] when we have --cycles >1, roll subtests together (1.0) Review of attachment 8648066 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me, some minor style preferences that I find more intuitive. This seems to work, I have an output like that with tresize and cycles==2: {"std": 0.19869552666664703, "min": 12.417350793333359, "max": 12.814741846666653, "median": 12.616046320000006, "filtered": 12.616046320000006, "mean": 12.616046320000006} Still we should try later to rework the results/output (maybe after graphserver end of life) to make summarization (and in a more general way result construction and printing) code looks better. ::: talos/output.py @@ +10,5 @@ > import logging > import post_file > import time > import utils > +import results as TalosResults Note that you can now use talos package. :) I think it would be good to import it with: from talos.results import Results since you only use that from the results module. @@ +494,5 @@ > test_result['testrun']['date'] = self.results.date > > # serialize test results > results = {} > + tsresults = [] It seems to me that you can assign that to None, and just call it tsresult since later you only append if the list is not empty (one element max). @@ +499,5 @@ > summary = {"suite": 0, "subtests": {}} > if not test.using_xperf: > vals = [] > + > + #TODO: counters!!!! we don't have any, but they suffer the same nit: space after "#TODO" - e.g., "# TODO" (there is another one after). @@ +509,5 @@ > + results.setdefault(test.name(), []).extend(val) > + if not tsresults: > + r = TalosResults.Results() > + r.results = [{'index': 0, 'page': test.name(), 'runs': val}] > + tsresults.append(r) only define tsresult if it is None here @@ +512,5 @@ > + r.results = [{'index': 0, 'page': test.name(), 'runs': val}] > + tsresults.append(r) > + else: > + #TODO: do we need such indirection > + for res in tsresults: same thing, only test for tsresult to be not None. @@ +513,5 @@ > + tsresults.append(r) > + else: > + #TODO: do we need such indirection > + for res in tsresults: > + for r in res.results: I think you can just use tsresult.results[0] since you add only one value above @@ +521,5 @@ > + results.setdefault(page, []).extend(val) > + > + tresults = test.results > + if tsresults: > + tresults = tsresults you can write a one line if you like, e.g.: tresults = [tsresult] if tresult else tests.results
Attachment #8648066 - Flags: feedback?(j.parkouss) → feedback+
hmm, we seem to have circular dependencies with: from [talos.]results import Results in results, we have: from talos import output and we use output.filters. could just move the filters to results, have to think on this a bit.
updated patch with review feedback. Including in this patch is the actual ability to use the new fancy v8 stuff since the filter was never defined. This should cleanup all our existing needs for talos summarization from the harness.
Attachment #8648066 - Attachment is obsolete: true
Attachment #8648470 - Flags: review?(j.parkouss)
Comment on attachment 8648470 [details] [diff] [review] when we have --cycles >1, roll subtests together (1.1) Review of attachment 8648470 [details] [diff] [review]: ----------------------------------------------------------------- Unfortunately, it fails: talos --develop -a tresize -e /usr/bin/firefox --cycles 2 ... Traceback (most recent call last): File "/home/jp/dev/talos/venv/bin/talos", line 9, in <module> load_entry_point('talos==0.0', 'console_scripts', 'talos')() File "/home/jp/dev/talos/talos/run_tests.py", line 284, in main sys.exit(run_tests(config, browser_config)) File "/home/jp/dev/talos/talos/run_tests.py", line 267, in run_tests talos_results.output(results_urls) File "/home/jp/dev/talos/talos/results.py", line 69, in output results = _output() File "/home/jp/dev/talos/talos/output.py", line 523, in __call__ result.values(test_result['testrun']['suite'], AttributeError: 'list' object has no attribute 'values' Also, I tried to fix the patch quickly (see the attachment), this seems to work for tresize (same with the previous patch) but with v8_7: talos --develop -a v8_7 -e /usr/bin/firefox --cycles 3 ... {"std": 0.0, "min": 795.5449482895783, "max": 795.5449482895783, "median": 795.5449482895783, "value": 930.1661730000001, "filtered": 930.1661730000001, "mean": 795.5449482895783} that seems not right (not sure if this patch is intended to address that or not though)
Attachment #8648470 - Flags: review?(j.parkouss) → review-
thanks for checking this out- I think I got hurried. I have attached a patch that is the original + the fix. I have retriggered on try all the jobs and looked at the summary section of the talosdata.
Attachment #8648470 - Attachment is obsolete: true
Attachment #8648663 - Flags: review?(j.parkouss)
Comment on attachment 8648663 [details] [diff] [review] when we have --cycles >1, roll subtests together (1.2) Review of attachment 8648663 [details] [diff] [review]: ----------------------------------------------------------------- Yup, np. :) Looks good to me (not tested though), just one small nit. ::: talos/output.py @@ +491,5 @@ > test_result['testrun']['date'] = self.results.date > > # serialize test results > results = {} > + tsresults = None Nit: I would name this tsresult (without trailing 's') since this hold only one test result instance. This will make more sense for the one line construction: tresults = [tsresult] if tsresult else test.results
Attachment #8648663 - Flags: review?(j.parkouss) → review+
Blocks: 1195291
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Did some verification, looks good for tresize and tspaint at least. Compare these two revisions, note how min/max didn't work before (were the same as mean), and do now: https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=[mozilla-inbound,726a741fd1dd1e9fb0b85bc9733658f6eabd07df,1]&highlightedRevisions=f106756af2a4&highlightedRevisions=f195230a7c5d
(In reply to William Lachance (:wlach) from comment #11) > Did some verification, looks good for tresize and tspaint at least. Compare > these two revisions, note how min/max didn't work before (were the same as > mean), and do now: > > https://treeherder.mozilla.org/perf.html#/ > graphs?timerange=604800&series=[mozilla-inbound, > 726a741fd1dd1e9fb0b85bc9733658f6eabd07df, > 1]&highlightedRevisions=f106756af2a4&highlightedRevisions=f195230a7c5d Would this be a correct description of the issue, taking your ts_paint example? : The TALOSDATA JSON blob (which treeherder uses as the main input from the raw log) had the correct (cycle) replicates both before and after the change for ts_paint, but the summary at the same JSOM blob for ts_paint used only the last replicate value for min/max/mean/etc ? (looking at the raw logs, before: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1439391127/mozilla-inbound_ubuntu64_hw_test-other_l64-bm103-tests1-linux-build3821.txt.gz and after: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1439923834/mozilla-inbound_ubuntu64_hw_test-other_l64-bm104-tests1-linux-build1955.txt.gz ) Question: did it also affect graphserver? or are there a different talos code paths for cycle replicate summarization between outputs designated to graphserver and treeherder?
that is a correct statement of using only the last cycle. This did not affect graphserver at all as this is just the data inside the talosdata json blob.
Thanks. So, according to 'test.py', the following tests were affected since they have more than one cycle: - ts_paint (20) - sessionrestore (10) - tresize (20) - media_test (5) - tps (5) tcanvasmark has 5 "tpcycles" rather than "cycles" (not to be confused with tppagecycles). Was it affected (and fixed) as well?
When I looked at the data tcanvasmark was not affected, keep in mind sessionrestore_no_auto_restore would be affected, and we don't run media_tests anymore.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: