Closed Bug 1156907 Opened 9 years ago Closed 8 years ago

tracking bug to migrate counters from graph server into treeherder

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(firefox40 affected)

RESOLVED FIXED
Tracking Status
firefox40 --- affected

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(2 files, 2 obsolete files)

oh we have counters, they all are uploaded quite nicely to graph server- but we don't have a way to currently display them in treeherder.

Graph server has a huge list of data point for each counter, those get summarized into a single data point and put on a graph.

For treeherder, we have the huge list of data points in the talos json blob, but these are stored in the aux_data- which is not ingested.

Here is a list of counters we collect:
tp5o:
linux*: main_rss, private bytes, xres
osx10: main_rss, responsiveness
win: main_rss, private bytes, responsiveness, %cpu
win7: main_rss, private bytes, responsiveness, %cpu, modlistbytes
win8: responsiveness

xperf (win7 only):
tp5n_main_normal_fileio_paint, tp5n_main_normal_netio_paint
tp5n_main_startup_fileio_paint, tp5n_main_startup_netio_paint
tp5n_nonmain_normal_fileio_paint, tp5n_nonmain_normal_netio_paint
tp5n_main_startup_fileio_paint


A next step is to look at the raw data produced by these and see if we care about the replicates or if we can just summarize the data in a few points and upload that instead of all the replicates.

One quirk will be that treeherder will magically be adding these as subtests to the summary, which we want to avoid- it will take some thought.

Once we understand the data we can start adjusting it and fixing talos, graphserver, perfherder.  We should also ensure this data is useful and assess how stable,reliable, and meaningful it is.
:aklotz, we have a list of counters above for xperf.  These are tracked and rarely alerted upon for a regression.  We find many issues with the mainthreadIO stuff- rad!  What are your thoughts on these?  Should we keep them, toss them, modify them?
Flags: needinfo?(aklotz)
I *think* I know what you're asking: I vote for keeping them for now.

Obviously for catching regressions we prefer that we just automatically go orange.

OTOH I think that keeping that data can be useful for ensuring that the test itself is doing what we want it to do.
Flags: needinfo?(aklotz)
Hey Will.  I am looking to make this a reality and I have updated the output from talos to be better informed.

1) tp5 (http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmaher@mozilla.com-beb62e8bce0d/try-win32/try_win7-ix_test-tp5o-bm111-tests1-windows-build232.txt.gz):
TALOSDATA: [{"talos_counters": {"responsiveness": {"max": "2565.56", "mean": "82.14"}, "Modified Page List Bytes": {"max": "108756992.00", "mean": "69332015.87"}, "% Processor Time": {"max": "100.00", "mean": "75.84"}, "Private Bytes": {"max": "222863360.00", "mean": "192576694.36"}, "Main_RSS": {"max": "239759360.00", "mean": "187039885.24"}}, "test_machine": {"platform": "x86", "osversion": "6.1.7601", "os": "win", "name": "T-W732-IX-061"}, "testrun": {"date": 1431722432, "suite": "tp5o", "options": {"responsiveness": true, "tpmozafterpaint": true, "tpchrome": true, "tppagecycles": 25, "tpcycles": 1, "tprender": false, "shutdown": false, "cycles": 1, "rss": true}}, "results": {"tudou.com": [491.0, 253.0, 246.0, 242.0, 251.0, 245.0, 245.0, 244.0, 244.0, 251.0, 245.0, 245.0, 253.0, 245.0, 251.0, 246.0, ...}}]

2) xperf (http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmaher@mozilla.com-beb62e8bce0d/try-win32/try_win7-ix_test-xperf-bm111-tests1-windows-build181.txt.gz):
37929860.0}, "main_normal_netio": {"mean": 7136403.0}, "main_startup_netio": {"mean": 13.0}, "nonmain_startup_fileio": {"mean": 763172.0}, "nonmain_normal_fileio": {"mean": 528476482.0}, "main_normal_fileio": {"mean": 40102144.0}}, "test_machine": {"platform": "x86", "osversion": "6.1.7601", "os": "win", "name": "T-W732-IX-089"}, "testrun": {"date": 1431967004, "suite": "tp5n", "options": {"responsiveness": false, "tpmozafterpaint": true, "tpchrome": true, "tppagecycles": 1, "tpcycles": 1, "tprender": false, "shutdown": false, "cycles": 1, "rss": true}}, "results": {}, "test_build": {"version": "40.0a1", "revision": "beb62e8bce0d", "id": "20150506182320", "branch": "Try-Non-PGO", "name": "Firefox"}}]

in addition I have added a printout of the counters raw values into the raw log in case there is a need to look for detailed information.

Please let me know what you think.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8607146 - Flags: feedback?(wlachance)
Comment on attachment 8607146 [details] [diff] [review]
initial stab at fixing talosdata json to have talos_counters (0.9)

Review of attachment 8607146 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed on irc, I'd prefer an approach which just added a custom class which just wrote the TALOSDATA structure to the logs, since we're going to stop submitting to DZ so soon.
Attachment #8607146 - Flags: feedback?(wlachance) → feedback-
* add counters to talosdata json blob in the log file
* print out counter raw data in the log
* removal of talos_aux and xperf_results from talosdata
* remove of datazilla from the output

NOTE: we still have oddities with datazilla ism's.  Especially as we still accept a datazilla-url.  Once this is deployed on all branches, we can tackle removing the mozharness bits to add datazilla-url and then we can cleanup perfconfig, etc.
Attachment #8607146 - Attachment is obsolete: true
Attachment #8608856 - Flags: review?(wlachance)
Comment on attachment 8608856 [details] [diff] [review]
remove datazilla object and add a perfherder object (1.0)

Review of attachment 8608856 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great! Would like to see one more iteration of this patch before landing, but it's quite close.

Do we still need to include the datazilla client libraries in talos? If not then I'd just as soon remove the dependencies now:

http://hg.mozilla.org/build/talos/file/3618f52d7d30/requirements.txt#l8
http://hg.mozilla.org/build/talos/file/3618f52d7d30/create_talos_zip.py#l37

Might be worth grepping for other instances of datazilla in the source, if it's not too much trouble. I don't want to hold up this bug on account of that, but if it only takes 5-10 minutes...

::: talos/output.py
@@ +209,5 @@
>  # (jhammel: we probably should do this in e.g. results.py vs in graphserver-specific code anyway)
>  
>                      # exclude counters whose values are tuples (bad for graphserver)
>                      if len(values) > 0 and isinstance(values[0], list):
> +                        print "---------------------------------"

Are these seperators really necessary? I don't care that much either way, but if they're not hugely helpful I'd prefer to leave them out.

@@ +386,5 @@
>              linkName = '%s: %s' % (testname, result)
>              print 'RETURN: %s' % link_format % (url, linkName)
>  
>  
> +class PerfHerderOutput(Output):

PerfHerder -> Perfherder :)

@@ +393,3 @@
>  
>      def output(self, results, results_url, tbpl_output):
>          """output to the results_url

This comment now seems to be a lie. It looks like we output the result as a log message, and optionally also to a file.

@@ +400,5 @@
>          # parse the results url
>          results_url_split = utils.urlsplit(results_url)
>          results_scheme, results_server, results_path, _, _ = results_url_split
>  
> +        utils.info("TALOSDATA: %s" % json.dumps(results))

We should probably put a comment in here that this is what treeherder/perfherder actually parses.

@@ +408,1 @@
>              f.close()

Not that important, but while we're here, let's turn this into a oneliner:
json.dump(results, file(results_path, 'w'), indent=2, sort_keys=True)

@@ +409,4 @@
>  
>      def post(self, results, server, path, scheme, tbpl_output):
> +        """conform to current code- not needed for perfherder"""
> +        pass

Is there any way we can remove the call to this and take it out?

@@ +461,5 @@
> +        # build information
> +        browser_config = self.results.browser_config
> +
> +        # a place to put results
> +        retVal = []

can you name this something more descriptive like test_results ? Then you wouldn't need the comment. :)

@@ +464,5 @@
> +        # a place to put results
> +        retVal = []
> +
> +        for test in self.results.results:
> +            testVal = {'test_machine': {},

likewise, let's call this test_result

::: talos/ttest.py
@@ +638,5 @@
>  
>              # include global (cross-cycle) counters
>              test_results.all_counter_results.extend([{key: value} for key, value in global_counters.items()])
> +            print "Outputting all counters"
> +            print "-----------------------------------"

Again, I'd prefer to leave out the seperators (and the "Outputting all ..." message) above if it's not genuinely helpful.
Attachment #8608856 - Flags: review?(wlachance) → review-
feedback addressed!

there is still a few instance of 'datazilla' in the code, specifically in perfconfigurator and things that use datazilla_urls.  I am going for majority wins here knowing full well that there is other work involved in removing these params.
Attachment #8608856 - Attachment is obsolete: true
Attachment #8608929 - Flags: review?(wlachance)
Comment on attachment 8608929 [details] [diff] [review]
remove datazilla object and add a perfherder object (1.1)

Review of attachment 8608929 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, looks good! Haven't tested it of course. r+ assuming it still works as expected and with my last nit addressed.

::: talos/output.py
@@ +455,5 @@
> +
> +        # build information
> +        browser_config = self.results.browser_config
> +
> +        testResults = []

Nit: you should use test_results, test_result to be consistent with the rest of the surrounding code here
Attachment #8608929 - Flags: review?(wlachance) → review+
Blocks: 1167287
now for the perfherder side of the equation!
Attachment #8609395 - Flags: review?(wlachance)
See Also: → 1167646
Comment on attachment 8609395 [details] [review]
https://github.com/mozilla/treeherder/pull/558

Still needs a bit of work, but on right track
Attachment #8609395 - Flags: review?(wlachance) → review-
Comment on attachment 8609395 [details] [review]
https://github.com/mozilla/treeherder/pull/558

updated the PR
Attachment #8609395 - Flags: review- → review?(wlachance)
Comment on attachment 8609395 [details] [review]
https://github.com/mozilla/treeherder/pull/558

Looks good except for a few remaining nits + some minor problems (e.g. unrelated non-trivial changes which should be applied in another a patch). r+ with those addressed.
Attachment #8609395 - Flags: review?(wlachance) → review+
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/f6a5073e1a10ccf5b80cc31e6150893083240647
Bug 1156907 - migrate counters from graph server into treeherder

https://github.com/mozilla/treeherder/commit/e2fdb69da9fbb77b21ba9fe0dbd588b0a0ea7b47
Merge pull request #558 from jmaher/perfcounters

Bug 1156907 - migrate counters from graph server into treeherder
addressed issues and merged.  2 of the main pieces are in place.

I want to:
1) verify talosdata is what we expect from talos
2) see this show up as valid counters in production treeherder
3) validate (this thursday) the counters and what we care about calculating
I have verified that what we are printing on trunk is indeed what we expect, so #1 is done.

#2 and #3 remain.

For #3 I have a series of graphs to display the raw data from individual runs (i.e. not graphs over time, but graphs for the raw counter replicates collected during a single run):
http://people.mozilla.org/~jmaher/counters/counters.html

These graphs take on odd shapes, I would like to start figuring out what is useful to measure so we can track over  time and file regressions on it.  What we have so far with the average value is working ok.  For each counter we should either:
* leave it alone
* disable it
* come up with a better calculation

Vladan, in our perftesting meeting today you asked to be included on a bug and have more information about it.  Lets use this as a starting point.
Flags: needinfo?(vdjeric)
Depends on: 1170241
Ok, I took a look. The memory and process time counters look meaningful, responsiveness is the only one that looks suspicious to me. Am I misinterpreting it?
Flags: needinfo?(vdjeric)
Vladan, thanks for taking some time to look at this. Yes, responsiveness is the only test which is different, it is outlined in more detail in bug 1170241.

As it stands we are calculating the mean and max values of the set of results going into treeherder.  This is inline with what we have been doing for graph server.  the alerting takes place on the mean value.
Depends on: 1174724
counters are in perfherder now.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: