Closed Bug 1193385 Opened 9 years ago Closed 9 years ago

Test summarization for subtests gives incorrect results for tresize (and possibly other tests)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: wlach, Assigned: jmaher)

References

Details

Attachments

(2 files, 2 obsolete files)

when we have --cycles >1, roll subtests together (1.0) 9 years ago Joel Maher ( :jmaher ) (UTC -8) 3.66 KB, patch	parkouss : feedback+	Details \| Diff \| Splinter Review
when we have --cycles >1, roll subtests together (1.1) 9 years ago Joel Maher ( :jmaher ) (UTC -8) 4.78 KB, patch	parkouss : review-	Details \| Diff \| Splinter Review
fix.patch 9 years ago Julien Pagès (:parkouss) 1.81 KB, patch		Details \| Diff \| Splinter Review
when we have --cycles >1, roll subtests together (1.2) 9 years ago Joel Maher ( :jmaher ) (UTC -8) 4.73 KB, patch	parkouss : review+	Details \| Diff \| Splinter Review

William Lachance (:wlach)

Reporter

Description

•

9 years ago

See here for example: https://treeherder.mozilla.org/logviewer.html#?job_id=12743279&repo=mozilla-inbound http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1439303977/mozilla-inbound_ubuntu64_hw_test-chromez-bm103-tests1-linux-build3881.txt.gz Despite having these replicates: 17.76164999999998, 20.503433333333373, 21.181566666666647, 18.56949999999996, 16.985500000000034, 17.831733333333343, 21.00719999999998, 18.389749999999985, 18.190666666666623, 17.328116666666666, 17.031466666666667, 19.077733333333324, 17.125199999999996, 16.616666666666674, 16.775350000000003, 20.67286666666666, 18.896600000000017, 18.486100000000025, 19.525000000000016, 19.44226666666668 We have these results for min/max/std/med/mean: {"std": 0.0, "min": 19.44226666666668, "max": 19.44226666666668, "median": 19.44226666666668, "filtered": 19.44226666666668, "mean": 19.44226666666668} Something is clearly off

Avi Halachmi (:avih)

Comment 1

•

9 years ago

From Joel on IRC: this bug probably applies to all startup tests.

William Lachance (:wlach)

Reporter

Updated

•

9 years ago

Summary: Test summarization for subtests for tresize (at least) → Test summarization for subtests gives incorrect results for tresize (and possibly other tests)

Avi Halachmi (:avih)

Updated

•

9 years ago

Blocks: 1194333

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 2

•

9 years ago

Attached patch when we have --cycles >1, roll subtests together (1.0) (obsolete) — Details — Splinter Review

this is really an issue of --cycles >1, but we only do that for startup_tests, and only have a problem (I believe so) with tresize,ts_paint,sessionrestore*. These tests all have a pagename NULL. This is a hack, I would like to have the tests spit out a common format and we can just simplify our parsing overall. This is where I am, maybe some other eyes can help shed light on a more elegant solution.

Assignee: nobody → jmaher

Status: NEW → ASSIGNED

Attachment #8648066 - Flags: feedback?(j.parkouss)

Julien Pagès (:parkouss)

Comment 3

•

9 years ago

Comment on attachment 8648066 [details] [diff] [review] when we have --cycles >1, roll subtests together (1.0) Review of attachment 8648066 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me, some minor style preferences that I find more intuitive. This seems to work, I have an output like that with tresize and cycles==2: {"std": 0.19869552666664703, "min": 12.417350793333359, "max": 12.814741846666653, "median": 12.616046320000006, "filtered": 12.616046320000006, "mean": 12.616046320000006} Still we should try later to rework the results/output (maybe after graphserver end of life) to make summarization (and in a more general way result construction and printing) code looks better. ::: talos/output.py @@ +10,5 @@ > import logging > import post_file > import time > import utils > +import results as TalosResults Note that you can now use talos package. :) I think it would be good to import it with: from talos.results import Results since you only use that from the results module. @@ +494,5 @@ > test_result['testrun']['date'] = self.results.date > > # serialize test results > results = {} > + tsresults = [] It seems to me that you can assign that to None, and just call it tsresult since later you only append if the list is not empty (one element max). @@ +499,5 @@ > summary = {"suite": 0, "subtests": {}} > if not test.using_xperf: > vals = [] > + > + #TODO: counters!!!! we don't have any, but they suffer the same nit: space after "#TODO" - e.g., "# TODO" (there is another one after). @@ +509,5 @@ > + results.setdefault(test.name(), []).extend(val) > + if not tsresults: > + r = TalosResults.Results() > + r.results = [{'index': 0, 'page': test.name(), 'runs': val}] > + tsresults.append(r) only define tsresult if it is None here @@ +512,5 @@ > + r.results = [{'index': 0, 'page': test.name(), 'runs': val}] > + tsresults.append(r) > + else: > + #TODO: do we need such indirection > + for res in tsresults: same thing, only test for tsresult to be not None. @@ +513,5 @@ > + tsresults.append(r) > + else: > + #TODO: do we need such indirection > + for res in tsresults: > + for r in res.results: I think you can just use tsresult.results[0] since you add only one value above @@ +521,5 @@ > + results.setdefault(page, []).extend(val) > + > + tresults = test.results > + if tsresults: > + tresults = tsresults you can write a one line if you like, e.g.: tresults = [tsresult] if tresult else tests.results

Attachment #8648066 - Flags: feedback?(j.parkouss) → feedback+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 4

•

9 years ago

hmm, we seem to have circular dependencies with: from [talos.]results import Results in results, we have: from talos import output and we use output.filters. could just move the filters to results, have to think on this a bit.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 5

•

9 years ago

Attached patch when we have --cycles >1, roll subtests together (1.1) (obsolete) — Details — Splinter Review

updated patch with review feedback. Including in this patch is the actual ability to use the new fancy v8 stuff since the filter was never defined. This should cleanup all our existing needs for talos summarization from the harness.

Attachment #8648066 - Attachment is obsolete: true

Attachment #8648470 - Flags: review?(j.parkouss)

Julien Pagès (:parkouss)

Comment 6

•

9 years ago

Comment on attachment 8648470 [details] [diff] [review] when we have --cycles >1, roll subtests together (1.1) Review of attachment 8648470 [details] [diff] [review]: ----------------------------------------------------------------- Unfortunately, it fails: talos --develop -a tresize -e /usr/bin/firefox --cycles 2 ... Traceback (most recent call last): File "/home/jp/dev/talos/venv/bin/talos", line 9, in <module> load_entry_point('talos==0.0', 'console_scripts', 'talos')() File "/home/jp/dev/talos/talos/run_tests.py", line 284, in main sys.exit(run_tests(config, browser_config)) File "/home/jp/dev/talos/talos/run_tests.py", line 267, in run_tests talos_results.output(results_urls) File "/home/jp/dev/talos/talos/results.py", line 69, in output results = _output() File "/home/jp/dev/talos/talos/output.py", line 523, in __call__ result.values(test_result['testrun']['suite'], AttributeError: 'list' object has no attribute 'values' Also, I tried to fix the patch quickly (see the attachment), this seems to work for tresize (same with the previous patch) but with v8_7: talos --develop -a v8_7 -e /usr/bin/firefox --cycles 3 ... {"std": 0.0, "min": 795.5449482895783, "max": 795.5449482895783, "median": 795.5449482895783, "value": 930.1661730000001, "filtered": 930.1661730000001, "mean": 795.5449482895783} that seems not right (not sure if this patch is intended to address that or not though)

Attachment #8648470 - Flags: review?(j.parkouss) → review-

Julien Pagès (:parkouss)

Comment 7

•

9 years ago

Attached patch fix.patch — Details — Splinter Review

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 8

•

9 years ago

Attached patch when we have --cycles >1, roll subtests together (1.2) — Details — Splinter Review

thanks for checking this out- I think I got hurried. I have attached a patch that is the original + the fix. I have retriggered on try all the jobs and looked at the summary section of the talosdata.

Attachment #8648470 - Attachment is obsolete: true

Attachment #8648663 - Flags: review?(j.parkouss)

Julien Pagès (:parkouss)

Comment 9

•

9 years ago

Comment on attachment 8648663 [details] [diff] [review] when we have --cycles >1, roll subtests together (1.2) Review of attachment 8648663 [details] [diff] [review]: ----------------------------------------------------------------- Yup, np. :) Looks good to me (not tested though), just one small nit. ::: talos/output.py @@ +491,5 @@ > test_result['testrun']['date'] = self.results.date > > # serialize test results > results = {} > + tsresults = None Nit: I would name this tsresult (without trailing 's') since this hold only one test result instance. This will make more sense for the one line construction: tresults = [tsresult] if tsresult else test.results

Attachment #8648663 - Flags: review?(j.parkouss) → review+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 10

•

9 years ago

http://hg.mozilla.org/build/talos/rev/c0039de7a1c6

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

9 years ago

Blocks: 1195291

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

9 years ago

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

William Lachance (:wlach)

Reporter

Comment 11

•

9 years ago

Did some verification, looks good for tresize and tspaint at least. Compare these two revisions, note how min/max didn't work before (were the same as mean), and do now: https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=[mozilla-inbound,726a741fd1dd1e9fb0b85bc9733658f6eabd07df,1]&highlightedRevisions=f106756af2a4&highlightedRevisions=f195230a7c5d

Avi Halachmi (:avih)

Comment 12

•

9 years ago

(In reply to William Lachance (:wlach) from comment #11) > Did some verification, looks good for tresize and tspaint at least. Compare > these two revisions, note how min/max didn't work before (were the same as > mean), and do now: > > https://treeherder.mozilla.org/perf.html#/ > graphs?timerange=604800&series=[mozilla-inbound, > 726a741fd1dd1e9fb0b85bc9733658f6eabd07df, > 1]&highlightedRevisions=f106756af2a4&highlightedRevisions=f195230a7c5d Would this be a correct description of the issue, taking your ts_paint example? : The TALOSDATA JSON blob (which treeherder uses as the main input from the raw log) had the correct (cycle) replicates both before and after the change for ts_paint, but the summary at the same JSOM blob for ts_paint used only the last replicate value for min/max/mean/etc ? (looking at the raw logs, before: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1439391127/mozilla-inbound_ubuntu64_hw_test-other_l64-bm103-tests1-linux-build3821.txt.gz and after: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1439923834/mozilla-inbound_ubuntu64_hw_test-other_l64-bm104-tests1-linux-build1955.txt.gz ) Question: did it also affect graphserver? or are there a different talos code paths for cycle replicate summarization between outputs designated to graphserver and treeherder?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 13

•

9 years ago

that is a correct statement of using only the last cycle. This did not affect graphserver at all as this is just the data inside the talosdata json blob.

Avi Halachmi (:avih)

Comment 14

•

9 years ago

Thanks. So, according to 'test.py', the following tests were affected since they have more than one cycle: - ts_paint (20) - sessionrestore (10) - tresize (20) - media_test (5) - tps (5) tcanvasmark has 5 "tpcycles" rather than "cycles" (not to be confused with tppagecycles). Was it affected (and fixed) as well?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 15

•

9 years ago

When I looked at the data tcanvasmark was not affected, keep in mind sessionrestore_no_auto_restore would be affected, and we don't run media_tests anymore.

You need to log in before you can comment on or make changes to this bug.