Closed Bug 1204625 Opened 4 years ago Closed 4 years ago

Add metadata on tests/summaries/counters for perfherder

Categories

(Testing :: Talos, defect)

defect
Not set

Tracking

(firefox45 fixed, b2g-v2.5 fixed)

RESOLVED FIXED
mozilla45
Tracking Status
firefox45 --- fixed
b2g-v2.5 --- fixed

People

(Reporter: wlach, Assigned: wlach)

References

Details

Attachments

(4 files, 3 obsolete files)

For bug 1158233, we should add some metadata to TALOSDATA about each test that we're running, to indicate both the units of measurement and whether lower/higher is better per test. I propose adding the following fields to each summary:

"units": some textual thing, like "MB", "Bytes", or "msec"
"lowerIsBetter": boolean, true or false

:MikeLing, would you be interested in taking this on? You'll get to learn a bit about Talos, which is a pretty interesting project. :jmaher has offered to help mentor.
mikeling, if you are interested, lets get started with a local build of firefox:
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Instructions
an example area of code to look at would be similar to the patch in bug 1200716.
Assignee: nobody → sabergeass
(In reply to Joel Maher (:jmaher) from comment #1)
> mikeling, if you are interested, lets get started with a local build of
> firefox:
> https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/
> Build_Instructions

Hi Joel, the local Firefox build has ready, due to i'm also trying to get involved into SeaMonkey: )
Hi Joel,

I had build firefox locally, and check the PR of bug 1200716. But it's weird, those code been address in bug 1200716 still in my codebase even after I use 'hg pull -u' to update it. For example, in talos/results.py, there are something like:
            data_summary = {
                'min': min(data),
                'max': max(data),
                'mean': filter.mean(data),
                'median': filter.median(data),
                'std': filter.stddev(data)
}

BTW, do I checkout the right place? Are there any files else I should pay attention to for this bug? How could I test it after I finish it? Please give me some hit so I can move on, thank you!
Bug 1200716 should be in the code:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/output.py?offset=0#518

What is the repository name that you work with (I look at the default value in .hg/hgrc)?  Is it http://hg.mozilla.org/mozilla-central?  There are other useful ones to work from as well, but that might be the problem.  What you are doing with 'hg pull -u' is the way to update it.
(In reply to Joel Maher (:jmaher) from comment #5)
> What is the repository name that you work with (I look at the default value
> in .hg/hgrc)?  Is it http://hg.mozilla.org/mozilla-central?  There are other
> useful ones to work from as well, but that might be the problem.  What you
> are doing with 'hg pull -u' is the way to update it.

Oh, I see where the problem is! I use the repository here http://hg.mozilla.org/build/talos and the related document is https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code.
Hi Joel,

I got right talos database now and local Firefox building. How could I test the result after add something to talos database? thank you: )
MikeLing, great question.

When you run talos locally with --develop, it will create a results.json file, in there you will see the raw output and there you will be able to find any additions to the data.

Once that is done, we would need to land it, then work on the treeherder ingestion.  We can test it out on try server as well, when you have a patch ready, I can help you push to try server.
(In reply to Joel Maher (:jmaher) from comment #8)

Thank you Joel! But I still got question about this issue,

I assume we should change this line:https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/output.py?offset=0#490 to summary = {"suite": 0, "subtests": {}, "units": "", "lowerIsBetter": true[ or flase]}. But how could I collect data to tell whether higher or lower numbers are better
 and find unit of measurement (e.g. 'MB') for memory consumption tests?
MikeLing,

Great question- I am thinking we could annotate lowerIsBetter in test.py:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/test.py#513

we could add an attribute to the test indicating lowerIsBetter which defaults to False.


As for the units of measurement it would be nice to put that in test.py as well, but most of the questions come related to counters which are loosely defined in test.py.

We could either make the counters a dict instead of array:
 win_counters = ['Main_RSS', 'Private Bytes', '% Processor Time']
would be:
 win_counters = {'Main_RSS': 'MB', 'Private Bytes'; 'MB', '% Processor Time': 'Time'}

then we would have to fix all instances of the counters to accept this annotations- that seems like the right approach for making things less confusing.

Currently we sort of have this support in output.py with:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/output.py#91
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/output.py#21

I would advocate using that for now and we can determine the proper method for annotating the measurement type.  :parkouss, what do you prefer here?
Flags: needinfo?(j.parkouss)
So, I don't know well that part of the code. I think we have two distinct cases:

- test
- counters

So, for the test case, changes needs to be done where MikeLing mentioned: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/output.py?offset=0#490.

I agree whith you Joel to add "lowerIsBetter" on Test classes (in talos/test.py file) - and also "unit", so TALOS_DATA result:

>   "summary": {
>      "subtests": {
>        "tresize": {
>          "filtered": 13.045075, 
>          "value": 13.045075
>        }
>      }, 
>      "suite": 13.045075
>    },

should then looks like:

>   "summary": {
>      "subtests": {
>        "tresize": {
>          "filtered": 13.045075, 
>          "value": 13.045075,
>          "unit": '...',
>          "lowerIsBetter": ...
>        }
>      }, 
>      "suite": 13.045075
>    },

The counters will be a bit trickier. Basically, I would prefer to not have to write 'MB' for each declaration of 'Main_RSS', to avoid possible mistake - it would be best to only say 'I want Main_RSS' from a test point of view. Because Main_RSS will always be in MB, this is not test-dependent. But this may require a counter refactoring; I will look more into this, to see if we can do better.

That being said, I'm ok with your proposal Joel, we can do better later (counters needs refactoring anyway!).

Basically, Mikeling, I would say you can start with the work for 'test'. Separating patches for test / counters seems good to me here.
Flags: needinfo?(j.parkouss)
thanks :parkouss for the great advice, lets move forward with what we have outlined here- counter refactoring can take place in the future.
Attached patch tests.patch (obsolete) — Splinter Review
Hi Joel and parkouss,

I'm not sure if this patch is fit your mind, because the value "lowserIsBetter" and "unit" are almost been hard code inside the result for now. But I think we need to calculate them in "counter" part or so, please give some hit then I can move on. Thank you :)
Attachment #8672610 - Flags: feedback?(jmaher)
Attachment #8672610 - Flags: feedback?(j.parkouss)
Comment on attachment 8672610 [details] [diff] [review]
tests.patch

Review of attachment 8672610 [details] [diff] [review]:
-----------------------------------------------------------------

One comment, but otherwise looks good to me!

I did not tested though, but assuming it works and that tcanvasmark is the only test where we want to add that for now (:jmaher or :wlach probably know about that) then this is great to me. :)

::: testing/talos/talos/output.py
@@ +519,5 @@
>                      for val, page in filtered_results:
> +                        measurement = {}
> +                        # Because the value of 'lowerIsBetter' is False or True, so we only test
> +                        # if the 'unit' is exit.
> +                        if test.name() == 'tcanvasmark':

hm, if I understand the command, should'nt we have:

if test.test_config.get('unit'):

instead of "if test.name() == 'tcanvasmark':" ?

BTW the comment does not seems good to me (only from an english point of view). Maybe you could remove that comment and explicitly check for the None value (meaning undefined here), something like:

if (test.test_config.get('unit') is not None and
    test.test_config.get('lowerIsBetter') is not None):
Attachment #8672610 - Flags: feedback?(j.parkouss) → feedback+
(In reply to Julien Pagès (:parkouss) from comment #14)

yep, you're right. The comment need to be fixed because it's totally unrelated with code. I just forget to remove that after make changes. Thank you for your help!
Comment on attachment 8672610 [details] [diff] [review]
tests.patch

Review of attachment 8672610 [details] [diff] [review]:
-----------------------------------------------------------------

not 100% correct, but I think a great patch for your first Talos patch!

::: testing/talos/talos/output.py
@@ +521,5 @@
> +                        # Because the value of 'lowerIsBetter' is False or True, so we only test
> +                        # if the 'unit' is exit.
> +                        if test.name() == 'tcanvasmark':
> +                            measurement = {'lowerIsBetter': test.test_config['lowerIsBetter'],
> +                                           'unit': test.test_config['unit']}

keying off of test.name() isn't ideal.  Can we somehow ensure that lowerIsBetter is true all the time, then always include it in the summary?

As for unit, we will need to support this in the counters, that is where it is most important.  I would like to default unit to 'time' or '' (depending on wlach) and then set it if it is different in the test (as you have done), or in the counters with other logic.

::: testing/talos/talos/test.py
@@ +524,5 @@
>      tpmozafterpaint = False
>      preferences = {'dom.send_after_paint_to_content': False}
>      filters = filter.ignore_first.prepare(1) + filter.median.prepare()
> +    lowerIsBetter = False
> +    unit = 'MB'

unit= 'MB' will be for counters only.

As for lowerIsBetter, this needs to be on v8_7, dromaeo_dom, dromaeo_css as well.
Attachment #8672610 - Flags: feedback?(jmaher)
Attachment #8672610 - Flags: feedback?
Attachment #8672610 - Flags: feedback-
Attachment #8672610 - Flags: feedback+
Attached patch tests.patch (obsolete) — Splinter Review
I'm not sure if it's ok to go, I will paste the output json file also.
Attachment #8672610 - Attachment is obsolete: true
Attachment #8673622 - Flags: review?(jmaher)
Attached file local.json
Attachment #8673630 - Flags: feedback?(jmaher)
Comment on attachment 8673630 [details]
local.json

this looks great.  I would like to ensure we have this at the summary level as well.  All the subtests have it.  Can we add in unit/lowerisbetter for the top level summary?
Attachment #8673630 - Flags: feedback?(jmaher) → feedback+
Comment on attachment 8673622 [details] [diff] [review]
tests.patch

Review of attachment 8673622 [details] [diff] [review]:
-----------------------------------------------------------------

do we know that lowerIsBetter is true all the time except when specified as False?  I would like to see an example of that when updating this patch!

This looks much better than the first patch, great job of working on this.

::: testing/talos/talos/output.py
@@ +523,5 @@
> +                        if test.test_config.get('unit') is not None:
> +                            measurement = {'lowerIsBetter': test.test_config['lowerIsBetter'],
> +                                           'unit': test.test_config['unit']}
> +                        else:
> +                            measurement = {'lowerIsBetter': test.test_config['lowerIsBetter']}

I would restructure this:
unit = test.test_config['unit']}
if not unit:
    unit = ""
measurement = {'lowerIsBetter': test.test_config['lowerIsBetter'],
               'unit': test.test_config['unit']}

@@ +534,5 @@
>  
>                  suite_summary = self.construct_results(vals,
>                                                         testname=test.name())
>                  summary['suite'] = suite_summary
>                  test_result['summary'] = summary

Add something here in summary to include lowerIsBetter and unit.  lowerIsBetter is equal across subtests, but unit might be different- I think we can keep it the same for now and add a comment to indicate the assumption.
Attachment #8673622 - Flags: review?(jmaher) → review-
(In reply to Joel Maher (:jmaher) from comment #20)
> ::: testing/talos/talos/output.py
> @@ +523,5 @@
> > +                        if test.test_config.get('unit') is not None:
> > +                            measurement = {'lowerIsBetter': test.test_config['lowerIsBetter'],
> > +                                           'unit': test.test_config['unit']}
> > +                        else:
> > +                            measurement = {'lowerIsBetter': test.test_config['lowerIsBetter']}
> 
> I would restructure this:
> unit = test.test_config['unit']}
> if not unit:
>     unit = ""
> measurement = {'lowerIsBetter': test.test_config['lowerIsBetter'],
>                'unit': test.test_config['unit']}

IMO we shouldn't print an empty string for units. If we don't have a value for the "unit", just omit it entirely from the output.
(In reply to Joel Maher (:jmaher) from comment #20)
> Comment on attachment 8673622 [details] [diff] [review]
> tests.patch
> 
> Review of attachment 8673622 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> do we know that lowerIsBetter is true all the time except when specified as
> False?  I would like to see an example of that when updating this patch!

Actually, I don't fully understand about meaning about 'lowerIsBetter is true all the time'. We have set lowserIsBetter as False but don't test its value and always include it into 'summary'.
After discussing on irc we need to ensure that perfherder can deal with json that doesn't have these two fields present.  That means that we assume in perfherder that:
lowerIsBetter=True
unit="time"

If either of the two new fields are present, we will use those values instead of the defaults.  So it is ok to not have an example of other tests showing the field lowerIsBetter=true.
Attached patch tests.patchSplinter Review
Attachment #8673622 - Attachment is obsolete: true
Attachment #8676090 - Flags: feedback?(jmaher)
Comment on attachment 8676090 [details] [diff] [review]
tests.patch

Review of attachment 8676090 [details] [diff] [review]:
-----------------------------------------------------------------

I think the unit stuff is mostly important for counters which are not defined in test.py.  I do like the usage of 'ms' and 'fps', if we want to go that route, then we should look in detail at each test.  Some tests like tart, cart, and damp measure a lot of different things- some 'ms', and some 'error'.  I would like to call those 'ms' tests, but that would be slightly inaccurate.  I do think we should move forward with this- maybe wlach has some ideas on the units to use.

What would be more important would be making sure the summary (not the subtest) has the betterIsLower and unit.  Overall this is a great start.

::: testing/talos/talos/output.py
@@ +523,5 @@
> +                        if test.test_config.get('unit'):
> +                            measurement = {'lowerIsBetter': test.test_config['lowerIsBetter'],
> +                                           'unit': test.test_config['unit']}
> +                        else:
> +                            measurement = {'lowerIsBetter': test.test_config['lowerIsBetter']}

I would still prefer:
measurement = {'lowerIsBetter': test.test_config['lowerIsBetter']}
if test.test_config.get('unit'):
  measurement['unit'] = test.test_config['unit']
Attachment #8676090 - Flags: feedback?(wlachance)
Attachment #8676090 - Flags: feedback?(jmaher)
Attachment #8676090 - Flags: feedback+
Comment on attachment 8676090 [details] [diff] [review]
tests.patch

Review of attachment 8676090 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me. I agree we might want to specify units for more tests (of course it's ok to also omit them if the test doesn't have them)

::: testing/talos/talos/test.py
@@ +640,5 @@
>      preferences = {'layout.frame_rate': 0,
>                     'docshell.event_starvation_delay_hint': 1,
>                     'dom.send_after_paint_to_content': False}
>      filters = filter.ignore_first.prepare(5) + filter.median.prepare()
> +    unit = '1/FPS'

I think it would be ok to use a more detailed set of units here, like "average frame interval (1/fps)"
Attachment #8676090 - Flags: feedback?(wlachance) → feedback+
Comment on attachment 8676090 [details] [diff] [review]
tests.patch

Review of attachment 8676090 [details] [diff] [review]:
-----------------------------------------------------------------

lets get this landed so we can move forward.
Attachment #8676090 - Flags: review+
Attached patch Updated patch (obsolete) — Splinter Review
I want this to land sooner than later, so I took the liberty of taking MikeLing's patch, unbitrotting it, and changing the comment. Credit where it is due in the commit message. :)

Try run: 

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c953d29c8e47
Assignee: sabergeass → wlachance
Attachment #8681440 - Flags: review?(jmaher)
Attachment #8681440 - Flags: review?(jmaher) → review+
Earlier patch was wrong, here's a corrected version. :)
Attachment #8681440 - Attachment is obsolete: true
Attachment #8681442 - Flags: review?(jmaher)
Comment on attachment 8681442 [details] [diff] [review]
Updated patch again

Review of attachment 8681442 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with a few nits fixed.

::: testing/talos/talos/test.py
@@ +547,5 @@
>  
>  class dromaeo(PageloaderTest):
>      """abstract base class for dramaeo tests"""
>      filters = filter.dromaeo.prepare()
> +    unit = 'second'

not sure about this unit, I would call it 'score' as well.

@@ +645,5 @@
>      preferences = {'layout.frame_rate': 0,
>                     'docshell.event_starvation_delay_hint': 1,
>                     'dom.send_after_paint_to_content': False}
>      filters = filter.ignore_first.prepare(5) + filter.median.prepare()
> +    unit = '1/FPS'

is this really 1/FPS, probably 'score' is sufficient for now- I could be wrong here.
Attachment #8681442 - Flags: review?(jmaher) → review+
Earlier try run had some issues, doing another go with Joel's nits addressed + some updates:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e8b705e32535
(In reply to William Lachance (:wlach) from comment #34)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/
> 24af9e133d6d5fd7eee578ab23995c070debacf6
> Bug 1204625 - Add series data (unit, whether lower is better) to Talos
> output;r=jmaher

Eugh, I forgot to credit MikeLing for this commit. Sorry MikeLing, thanks for all your efforts on this! :)
Just realized we need to add lower is better metadata to the summary as well.
Attachment #8683866 - Flags: review?(jmaher)
Comment on attachment 8683866 [details]
Add lower is better metadata to summary as well

good find!
Attachment #8683866 - Flags: review?(jmaher) → review+
Doing a minimal try run on updated patch before landing:

https://hg.mozilla.org/try/rev/f0ab09c39ec4
https://hg.mozilla.org/mozilla-central/rev/24af9e133d6d
https://hg.mozilla.org/mozilla-central/rev/7bd635612263
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
You need to log in before you can comment on or make changes to this bug.