Should document summarization algorithms in talos test descriptions

RESOLVED FIXED

Status

Testing
Talos
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: wlach, Assigned: jmaher)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

We have documentation providing a description of each talos test here:

https://wiki.mozilla.org/Buildbot/Talos/Tests

To help people interpret these results, we should document whatever formulas we use for the suite-level summaries here. Now that we're generating these results within talos itself (bug 1150152), this should be quite helpful for people running talos locally.
See Also: → bug 1184966

Comment 1

3 years ago
Terminology is probably also worth document, specifically the relations between these terms:
- replicates
- subtest
- test (or suite?)
- job
Assignee: nobody → jmaher
I have completed this task by adding definitions of terminology and what formulas we use on the data:
https://wiki.mozilla.org/Buildbot/Talos/Data

Then I have annotated all the tests:
https://wiki.mozilla.org/Buildbot/Talos/Tests

to include what data and summarization is used.  There is also example data which you can expand/collapse if you are interested in what the raw data should look like.

:glandium, you have always done a great job of finding real issues with our tools and we would really like to get your point of view on if this pass of documentation work is helpful.
Flags: needinfo?(mh+mozilla)
I did a few minor edits:
https://wiki.mozilla.org/index.php?title=Buildbot%2FTalos%2FData&diff=1091736&oldid=1091285

Now, some comments:
> - in Perfherder a suite is referenced as a 'Summary' (e.g "tp5o summary opt") 

Can't Perfherder just use the same terminology?

> Terminology
>  Job
>  Suite
>  Subtest
>  Replicates

Leaving aside the terms themselves, I think it would be better if this section was reversed in order, because each of them keeps referring to the next one, and when you read it, it's better to start from the "base" (replicates) and grow the knowledge from there than have the reader wonder in each paragraph "what the hell is $thing" and then realize "oh, $thing is explained next".  That's exactly how that happened for me, except it only really happened for $thing == "replicate", because I already had a grasp of the other terms.

Which brings to the issue of terms themselves. Since I had to wonder what the hell a replicate is in this context, is it really the most appropriate term to be using?

> Dromaeo
>  returns: a single number (geometric_mean of the metric summarization) 

metric summarization doesn't ring a bell. And Googling doesn't help. Reading the code, it looks like you're doing a geometric mean of the means of $whatever.

Side-stepping from the documentation, this seems to me to be the wrong thing to do, but it's not something to discuss here, there's already a bug about geometric mean not working ;) (and reading further I even realize the "global" score is a geometric mean of those geometric means. Waw.

> v8_subtest
>  returns: a single value representing the benchmark weighted score for the subtest

That's very vague, although reading the code does help, but should I have to read the code? (btw, if talos code was on dxr, and dxr had permalinks for specific versions of the files, the multi-line links on dxr would really be helpful to highlight the right function. For v8_subtest, it's especially annoying because it's last in the file, so when you click on the link, you don't actually get to the function, and the visual cue for the anchor is almost invisible)

> (filter name)
>   (description)
>   (...)
>   returns: (...)

There's an inconsistency in how things are described for each filter. Some have the description of what the returned value is in the "returns" parts, some don't. Some have the specifics more detailed in the description, and some have them more detailed in the "returns" parts.

> http://hg.mozilla.org/build/talos/file/3625fcaa75ea/talos/output.py#l115

Reading this I have no idea why val_list has pairs of something, one of which is ignored, and I was driven to read the code by the wiki page saying "essentially it is a sum of the subtests", that I read as meaning there might be some other minor details left out, which, in fact, there aren't.

> NOTE: this is identical to JS_Metric

Why have two different functions, then? This is quoting the code, but the same can be said from the wiki page content itself. The wiki page lists filters that are more or less generic, why have two filters that do the same generic thing, except they do it for different subtests?

> When the raw data comes in, we look for the summary tag in the json.
> 
> {"suite": 23.21740000000001, ... }
>
> In this case we use 23.217 for the value inside of perfherder.

If you mean that perfherder truncates to the third decimal place, please just say it. And is it rounding or truncating?

> In all cases there should be a 'subtests' field as well that lists out
> each page loaded along with a set of values:
>
> "subtests": {"tresize": {"std": 0.7716690474213389, "min": 22.18191666666666, "max": 24.96453333333337, "median": 23.21740000000001, "filtered": 23.21740000000001, "mean": 23.254913333333334}
>
> These values are used in the sub test specific view (not the suite summary).
> When viewing a graph, you can switch between different values for each
> data point to see what the mean, median, etc. are. This is where we get the
> fields. In addition, the default value is the 'filtered' value, this takes
> into account filters (ignore first 'x' data points, median|mean, etc.) on
> the raw data so we have summarized data being calculated at a single point.

I think this could use some more verbosity as to what is meant. I had to read this three times, and I'm not entirely sure I got it.
Flags: needinfo?(mh+mozilla)
Some comments and one question for :jmaher...

(In reply to Mike Hommey [:glandium] from comment #3)
> I did a few minor edits:
> https://wiki.mozilla.org/index.
> php?title=Buildbot%2FTalos%2FData&diff=1091736&oldid=1091285
> 
> Now, some comments:
> > - in Perfherder a suite is referenced as a 'Summary' (e.g "tp5o summary opt") 
> 
> Can't Perfherder just use the same terminology?

They really are different things: a suite is collection of subtests, a summary is value representing some number representative of each of the subtest results.

I tried to update the docs to be a bit more clear on this:

https://wiki.mozilla.org/index.php?title=Buildbot%2FTalos%2FData&diff=1092331&oldid=1091736

> > Terminology
> >  Job
> >  Suite
> >  Subtest
> >  Replicates
> 
> Leaving aside the terms themselves, I think it would be better if this
> section was reversed in order, because each of them keeps referring to the
> next one, and when you read it, it's better to start from the "base"
> (replicates) and grow the knowledge from there than have the reader wonder
> in each paragraph "what the hell is $thing" and then realize "oh, $thing is
> explained next".  That's exactly how that happened for me, except it only
> really happened for $thing == "replicate", because I already had a grasp of
> the other terms.

Hmm, I'm not sure about this. I think most people would want to understand how jobs relate to what is collected, so I think it at least makes sense to talk about that first. And "suites" are really the most important high-level concept for people to understand here.

> Which brings to the issue of terms themselves. Since I had to wonder what
> the hell a replicate is in this context, is it really the most appropriate
> term to be using?

I'm not crazy about the term "replicate" either. Maybe we should just call them subtest results?

> > Dromaeo
> >  returns: a single number (geometric_mean of the metric summarization) 
> 
> metric summarization doesn't ring a bell. And Googling doesn't help. Reading
> the code, it looks like you're doing a geometric mean of the means of
> $whatever.

I agree, this explanation doesn't make much sense to me and reading the code makes my brain hurt. Do we have an explanation/justification for why we're doing the calculations this way somewhere? :jmaher?

> > v8_subtest
> >  returns: a single value representing the benchmark weighted score for the subtest
> 
> That's very vague, although reading the code does help, but should I have to
> read the code? (btw, if talos code was on dxr, and dxr had permalinks for
> specific versions of the files, the multi-line links on dxr would really be
> helpful to highlight the right function. For v8_subtest, it's especially
> annoying because it's last in the file, so when you click on the link, you
> don't actually get to the function, and the visual cue for the anchor is
> almost invisible)

I can't think of a better way of explaining this offhand, I agree it's hard to understand without looking at the source, but I'm not sure if that's such a bad thing.

> > (filter name)
> >   (description)
> >   (...)
> >   returns: (...)
> 
> There's an inconsistency in how things are described for each filter. Some
> have the description of what the returned value is in the "returns" parts,
> some don't. Some have the specifics more detailed in the description, and
> some have them more detailed in the "returns" parts.

Hmm, it actually doesn't look that inconsistent to me? Specific examples of how an individual filter is confusing and how that might be addressed would be helpful.

> > http://hg.mozilla.org/build/talos/file/3625fcaa75ea/talos/output.py#l115
> 
> Reading this I have no idea why val_list has pairs of something, one of
> which is ignored, and I was driven to read the code by the wiki page saying
> "essentially it is a sum of the subtests", that I read as meaning there
> might be some other minor details left out, which, in fact, there aren't.
> 
> > NOTE: this is identical to JS_Metric
> 
> Why have two different functions, then? This is quoting the code, but the
> same can be said from the wiki page content itself. The wiki page lists
> filters that are more or less generic, why have two filters that do the same
> generic thing, except they do it for different subtests?

I think this may have been fixed, don't see on the copy of the wiki I'm looking at.

> > When the raw data comes in, we look for the summary tag in the json.
> > 
> > {"suite": 23.21740000000001, ... }
> >
> > In this case we use 23.217 for the value inside of perfherder.
> 
> If you mean that perfherder truncates to the third decimal place, please
> just say it. And is it rounding or truncating?

Fixed.

> > In all cases there should be a 'subtests' field as well that lists out
> > each page loaded along with a set of values:
> >
> > "subtests": {"tresize": {"std": 0.7716690474213389, "min": 22.18191666666666, "max": 24.96453333333337, "median": 23.21740000000001, "filtered": 23.21740000000001, "mean": 23.254913333333334}
> >
> > These values are used in the sub test specific view (not the suite summary).
> > When viewing a graph, you can switch between different values for each
> > data point to see what the mean, median, etc. are. This is where we get the
> > fields. In addition, the default value is the 'filtered' value, this takes
> > into account filters (ignore first 'x' data points, median|mean, etc.) on
> > the raw data so we have summarized data being calculated at a single point.
> 
> I think this could use some more verbosity as to what is meant. I had to
> read this three times, and I'm not entirely sure I got it.

Perhaps a specific example might be useful?
Flags: needinfo?(jmaher)
> > > http://hg.mozilla.org/build/talos/file/3625fcaa75ea/talos/output.py#l115
> > 
> > Reading this I have no idea why val_list has pairs of something, one of
> > which is ignored, and I was driven to read the code by the wiki page saying
> > "essentially it is a sum of the subtests", that I read as meaning there
> > might be some other minor details left out, which, in fact, there aren't.
> > 
> > NOTE: this is identical to JS_Metric
> > 
> > Why have two different functions, then? This is quoting the code, but the
> > same can be said from the wiki page content itself. The wiki page lists
> > filters that are more or less generic, why have two filters that do the same
> > generic thing, except they do it for different subtests?
>
> I think this may have been fixed, don't see on the copy of the wiki I'm looking at.

Both Canvasmark_metric and js_metric are in the current copy of the wiki. They both do the same thing.

> > > (filter name)
> > >   (description)
> > >   (...)
> > >   returns: (...)
> > 
> > There's an inconsistency in how things are described for each filter. Some
> > have the description of what the returned value is in the "returns" parts,
> > some don't. Some have the specifics more detailed in the description, and
> > some have them more detailed in the "returns" parts.
>
> Hmm, it actually doesn't look that inconsistent to me? Specific examples of how
> an individual filter is confusing and how that might be addressed would be helpful.

See for example mean vs dromaeo vs v8_subtest.

- mean has an independent description defining what the returned value is, and "returns: a single value".
- dromaeo has an independent description defining what the returned value is at a high level, and "returns: a single value" followed by a description of what the calculation actually is.
- v8_subtest has no independent description, and "returns: a single value" followed by a (vague) description of what the calculation actually is.

That's what I mean is inconsistent. They're not all described the same way.

Comment 6

3 years ago
To make it more readable, I'm posting here for reference a summary of all the subtests calculations (from replicates) and all the suite (tests) from subtests, for all current suites.

As you can tell, for most of the tests:
- The subtest summary value is the median of all replicates after throwing away the first 5 or 1 value
- The suite summary value is the geometric mean of all the subtest summary values.

The exceptions are kraken, dromaeo and v8_7.

  // c job
  tresize: {
    subtest: medianAfterRemovingFirst(5),
    suite  : geomeanSubtests
  },

  tcanvasmark: {
    subtest: medianAfterRemovingFirst(1),
    suite  : sumSubtests
  },

  // d job
  v8_7: {
    subtest: v8SubtestSummary,
    suite  : function (subtests) { return 100 * geomeanSubtests(subtests); }
  },

  kraken: {
    subtest: average,
    suite  : sumSubtests
  },

  dromaeo_css: {
    subtest: dromaeoSubtestSummary,
    suite  : geomeanSubtests
  },

  // g1 job
  tp5o_scroll: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  glterrain: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  // g2 job
  damp: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  tps: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  // o job
  a11yr: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  ts_paint: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  tpaint: {
    subtest: medianAfterRemovingFirst(5),
    suite  : geomeanSubtests
  },

  sessionrestore: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  sessionrestore_no_auto_restore: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  // s job
  tsvgx: {
    subtest: medianAfterRemovingFirst(5),
    suite  : geomeanSubtests
  },

  tsvgr_opacity: {
    subtest: medianAfterRemovingFirst(5),
    suite  : geomeanSubtests
  },

  tscrollx: {
    subtest: medianAfterRemovingFirst(5),
    suite  : geomeanSubtests
  },

  tart: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  cart: {
    subtest: medianAfterRemovingFirst(1),
    suite  : geomeanSubtests
  },

  // tp job
  tp5o: {
    subtest: medianAfterRemovingFirst(5),
    suite  : geomeanSubtests
  }
}

// the basic subtest summary filter template
function medianAfterRemovingFirst(firstN) {
  return function(replicates) {
    return median(replicates.slice(firstN));
  }
}


// Takes an object of {k1: v1, k2: v2, ...} and returns an array [v1, v2, ...]
// NOTE: the order is arbitrary
function arrayify(flatObject) {
  var arr = [];
  for (var i in flatObject) {
    arr.push(flatObject[i]);
  }

  return arr;
}

// basic suite summary from named (but unordered) subtest summaries
function geomeanSubtests(subtests_summaries_object) {
  return geomean(arrayify(subtests_summaries_object));
}

function sumSubtests(subtests_summaries_object) {
  return sum(arrayify(subtests_summaries_object));
}

function dromaeoSubtestSummary(replicates) {
  var CHUNK = 5;
  var means = [];
  for (var i = 0; i < replicates.length; i += CHUNK) {
    means.push(average(replicates.slice(i, i + CHUNK)));
  }

  return geomean(means);
}

function v8SubtestSummary(replicates, name) {
  reference = {
    'Encrypt': 266181.,
    'Decrypt': 266181.,
    'DeltaBlue': 66118.,
    'Earley': 666463.,
    'Boyer': 666463.,
    'NavierStokes': 1484000.,
    'RayTrace': 739989.,
    'RegExp': 910985.,
    'Richards': 35302.,
    'Splay': 81491.
  };

  return reference[name] / geomean(replicates)
}
I have updated the wiki pages with a few edits:
* s/replicates/data points/
* s/old links to talos repo/m-c repo/
* a few nits from the comments above

These are the pages edited:
https://wiki.mozilla.org/Buildbot/Talos/Data
https://wiki.mozilla.org/Buildbot/Talos/Tests

There have been a series of other random cleanups on those pages in the last few weeks, although nothing drastic.  I would be happy to work on other items, given direction of what is unclear.
Flags: needinfo?(jmaher)
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.