Closed Bug 1196259 Opened 9 years ago Closed 7 years ago

Validate data correctness in talos - treeherder - perfherder

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: avih, Unassigned)

References

Details

The goal of this bug is to come up with a system which can verify Talos data correctness "end to end" as much as possible.

Since this definition is a bit vague, I'll offer some interpretation of it. Let me know what you think.

There are 3 areas IMO which could use verification (which pretty much match the subjects on bug 1194333):

1. Data crunching on the talos side, where the inputs are the raw replicates and the outputs to be verified are the talos summarization. Depends on:

- test-specific formulas, where most (but not all) of those are remove_first(N) then median the rest.

- Formulas for crunching replicates of the same subtest from several cycles x tpcycles x tppagecycles into a subtest summary for a test on this job (for the compare-perf subtests page).

- formulas for crunching the data into a suite summary value (takes all subtests/replicates into account, used for the compare-perf overview page), typically involves geomean the subtests of each cycle and then filtering(?) these geomeans.


2. Treeherder uses the correct talos outputs (e.g. filtered rather than means, etc), in other words, verify that the inputs to the compare-perf views (overall and subtests) and the points on TH graphs match the relevant talos outputs.


3. compare-perf (overview and subtests) crunches its inputs correctly. Depends on:

- All the inputs to the view (possibly already crunched into tallos summary values)

- The summarization formulas from several summarizations into a single value (per revision), for either overview or subtests.



We've had bugs in the past on all three areas above. (1) e.g. the tresize summarization bug. (2) just before we moved the summarization from treeherder to talos. (3) before we started using t-test on compare-perf.


So I think we have one side of the "end to end" pretty clearly - all the talos replicates relevant for a specific view.

The other side, however, is not trivial, since large parts of it happen on the client side (specifically, the views of the compare-perf pages).

Also, there's the issue of automation, i.e. how much of it should be automated and in what way we want to execute it.


Here's what I first had in mind in terms of inputs/outputs:

1. The verification system uses a URL as input (compare-perf overview/subtests page URL, or a data point on TH graphs).

2. The verification system fetches the relevant replicates associated with this view, and reproduces the view's output independently from the raw data.

3. The user compares the actual vs expected values manually (side by side on two windows, or just a value for TH graph point).


And, possibly 4 (only for the compare-perf views):

- We modify the views to include the view data in a way which the verification system can digest (e.g. by clicking a link which displays a json blob of the data which the user sees), where we can then use this data as further input to the verification system such that we verify it automatically rather than manually side by side. I don't think it would be wise to add a TH API for the view data (i.e. the outputs of the compare-perf pages), since the only code to calculate that is client side only, and duplicating it for a TH API would IMO beat the purpose of verifying what the user sees.


As for the talos inputs, I think they should be the replicates at the TALOSDATA json blobs inside the raw talos logs. Parsing these JSON (possibly from the raw log) and crunching each of those into anything in order to verify the talos crunching side should not be an issue IMO.


Issues which need considerations:

- For (2), how do we associate a treeherder view with the relevant raw talos data (either replicates, or, assuming (1) already verifies the talos summarization, only the talos summaries)? I _think_ that being able to associate a TH view with a list of raw talos log URLs would suffice for me, but this is also something I don't know how to do yet.

- To fully automate it for unattended verification would require a working browser, since one side of the "end to end" - the outputs of the compare-perf views - are not stored anywhere in treeherder and only calculated at the browser and then displayed.

- Do we also want to verify graphserver data (even if we're not likely to fix them, still important to know about them)? Earlier today Joel and myself noticed a possible data correctness issue with GS values at the raw talos logs.


Comments? thoughts? gotchas? Is it too much or redundant? Did I miss something important?
Flags: needinfo?(wlachance)
Flags: needinfo?(vdjeric)
Flags: needinfo?(jmaher)
Some thoughts:
- Focus exclusively on verifying perfherder compare first, postpone verifying TreeHerder graphs
- Focus on verifying end-to-end from Talos data to Compare-Talos. That means you'll need to verify the data crunching done on Talos and compare-perf data crunching within the webpage (items #1 and #3 in your comment)
- The input to your verification system should be 2 revisions or treeherder URLs. Your system can then fetch the Talos logs and launch the compare-perf page. For now, the person that runs your script can visually diff script output vs compare-perf output

From "Issues which need considerations":

- I don't understand what you mean by "treeherder view"
- Don't try to fully automate this yet, leave the parts that are hard to automate for later
- GraphServer verification is out of scope
Flags: needinfo?(vdjeric)
Thanks, that's good focusing and prioritizing.

(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #1)
> - Focus on verifying end-to-end from Talos data to Compare-Talos. That means
> you'll need to verify the data crunching done on Talos and compare-perf data
> crunching within the webpage (items #1 and #3 in your comment)

1 and 3 are not an issue since each of them is limited in scope to a single place (respectively a single talos log file, and the inputs to a single compare view page), but the toughest cookie in the "end to end" equation is 2.

(2) is not about verifying any calculations, since there are no more calculations happening anywhere other than in (1) and (3) (specifically, treeherder is now a 1:1 data storage and passthrough, and that's it), but rather verifying that the correct data sources are used as inputs for (3).

So the question (2) tries to answer/verify is: "did we use incorrect talos output? did we miss a talos output we should have been using as well?", etc.

> - The input to your verification system should be 2 revisions or treeherder
> URLs. Your system can then fetch the Talos logs and launch the compare-perf
> page. For now, the person that runs your script can visually diff script
> output vs compare-perf output

For given two revisions, compare-perf has MANY pages. It has an overview page where each line compares each test on each platform (and optimization level, e10s flag, etc - each of these multiplies the lines rather than adding), and then each of those lines links into a new "subtests view" which lists the differences for each subtest for this test x platform.

So while it's not any harder technically for the verification script to output the main compare + all the subtests compares (typically several per test) which compare-perf can generate for two revisions, it's going to be a LOT of data to sift through visually, and fetch programmatically.

We're talking hundreds of megabytes of talos data (assuming we're parsing the raw talos logs), and output which spans dozens of pages to cover all the pages which compare-perf can display for two given revisions, and depending on the users network performance and the servers, this could take many minutes to get all the data which cover all the comparison views between two revisions.

My suggestion was that the input is a compare view URL, be it either the overview page, or any of the subtests pages, and the script will verify the output of this specific page.


> From "Issues which need considerations":
> 
> - I don't understand what you mean by "treeherder view"

This is a compare-perf view between two revisions - either the overview comparison, or any of the subtests comparison pages for these two revisions (and we're dropping the treeherder graphs for now).


But let's wait for Joel's and Will's inputs before we delve into more details.
(In reply to Avi Halachmi (:avih) from comment #0)

> - For (2), how do we associate a treeherder view with the relevant raw talos
> data (either replicates, or, assuming (1) already verifies the talos
> summarization, only the talos summaries)? I _think_ that being able to
> associate a TH view with a list of raw talos log URLs would suffice for me,
> but this is also something I don't know how to do yet.

So this part is a bit tricky. I think the easiest approach would probably be to reproduce the client-side logic in a seperate python script. Conveniently, we store a reference to the job for each performance datum, so we could look up a reference to the job from there. This goes something like:

(1) Download the list of performance signatures for each branch
(2) Download all performance data for each revision
(3) Download all job logs for each performance datum, parse out the TALOSDATA structure
(4) Re-calculate the summary information from the replicates stored in the TALOSDATA structure. One possibly tricky bit here is relating the performance signature (what identifies the platform/test/options for a performance series) stored by perfherder back to the TALOSDATA blob, hopefully it isn't too hard. 
(5) Verify that the summary information we recalculated matches what we're displaying.

I can probably help you get started with this script, as I said on irc yesterday this will be a bit easier/faster once the database refactoring I started in bug 1192976 is finished.

> - To fully automate it for unattended verification would require a working
> browser, since one side of the "end to end" - the outputs of the
> compare-perf views - are not stored anywhere in treeherder and only
> calculated at the browser and then displayed.

Yeah, you'd need to use something like selenium for this to be automated. I think I'd probably start with a semi-automatic solution for now. Once the above script is in place, I don't think it should take too long to run through things and be pretty confident in the results.
Flags: needinfo?(wlachance)
I like the way this is going in general.  We should be able to get the logs from the build directory easily, if not we could probably find them via some treeherder apis.

Validating raw replicates -> summarized data would validate all of talos work and can be done outside of perfherder.  This would ensure that what is in talosdata is valid.

Next we should validate perfherder ingestion and what we see in compare view.  The problem here is do we want to validate revisions with >1 data point?  That is useful for compare view, but adds to the complexity.

I would think for validating perfherder data we could use the api to query the data and validate it with the talosdata we have.

This leaves only summarization in compare view.  compare view does calculations of its own- could we add a mode that generates a json output of the results for a given view?  If we did that this would be more programatically resolved.

There is some code in here to query data from perfherder:
https://github.com/jmaher/alert_manager/blob/master/alerts.py

this also does some rough math comparisons (still to be developed) and could serve as a starting point for this toolchain.
> So while it's not any harder technically for the verification script to output the main compare + all 
> the subtests compares (typically several per test) which compare-perf can generate for two revisions, 
> it's going to be a LOT of data to sift through visually, and fetch programmatically.

Ok. We can focus exclusively on a single platform & a few tests to start, and expand from there. Using a compare view URL is fine too, but you can leave that for later
Just to make the scope of the task clearer, each side of the "end to end" is split among many different end points, but the "main" overview compare page is a single end point which implicitly takes all the data relevant to these two revisions into account.

Technically this includes all the data which the all subtests views take into account, but practically it doesn't.

Since this is a humongous amount of source data for a single compare view overvire page, treeherder decouples the data into "tiers" of summaries (all verbatim outputs from talos, where talos outputs two tiers), such that each compare view page (overview or subtests) uses a much smaller set of inputs.

On top (or rather before) that, the processing of the replicates only happen on talos (verifiable by (1) on independently for each relevant talos job), and while treeherder stores the replicates, they're not used as inputs to any of the compare view pages (they were used as inputs to calculations inside treeherder before bug 1184966 landed).

Theoretically, in order to to verify end-to-end that the overall compare view page is correct, we'll have to crunch (after collecting) all the replicates which the view is derived from.

I _think_ that this is impractical due to the magnitude of the source data if we're not depending in some way on the intermediate summarization tiers.

OTOH, depending on existing summarization tiers does NOT verify the data end to end. So there's a tricky point here which we need to understand how to tackle.


I'll describe the data flow and summary tiers to make sure we're all on the same page, going from the overall compare view to its data sources.

----------------------------------

Let's take as an example this compare-perf overview page from bug 1190664: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fa358692aeaf&newProject=try&newRevision=54483e48d958 (and assume it has the new data flow where treeherder doesn't crunch numbers):

This page shows comparisons for 22 different talos tests (suits) (tscrollx, tcanvasmark, etc) over 6 platforms (win7-32, linux-64, etc), overall 132 comparison lines for opt builds only and for non-e10s only.

Multiply that by (almost) two if we also had PGO builds for these revisions.

Multiply again by two if we also had and displayed e10s builds comparisons.

If we happened to have this data (and all m-c builds should have this amount of data, maybe slightly reduced), we would have ended up with 528 comparison lines for the overview page. While I personally don't recall such big overview pages, they could (do?) exist.

Anyway, let's stay with our existing 132 lines.

After retriggering to have more stable data, we have 20 retriggers for all jobs - for each of base/new revisions which are being compared and on all platforms.

So each of the existing 132 lines uses unique 40 data points as input: 20 suit summaries for base+new revisions of this configuration - overall 5280 unique suite summaries (each a single number) as inputs for the overview page.

The overview page grabs those 5280 numbers from treeherder and crunches them client side to generate the overview comparison page view.

Let's say that a single talos job runs 3 different tests on average (of a single build configuration), and so produces 3 suite summaries (3 values of the 5280 which this overview uses as inputs, contributing 1 value of 40 to 3 lines of 132).

This means we ran 5280/3 = ~1760 different talos jobs in order to generate the inputs for this overview comparison page.

I only examined a single talos job log, and it ended up as 1.2M (the actual usable data inside that log was much smaller - 40K), but treeherder ingested each of those 1760 log files fully, and stored what it needed from them.

Let's assume each talos job log is 500K on average. So that's 1760 jobs, which together output almost 900M of raw data to digest as input, in order to display a single comparison overview page.

While I don't want to even begin to think of how many CPU hours and $$$ were used for that, from our end-to-end verification perspective for this single compare overview page, this means the script will have to do the following:

1. Figure out those 1760 talos job logs URLS.

2. fetch 1760 talos logs, each 500K size on average.

3. verify that each of those 1760 jobs crunched the numbers correctly*, that's essentially task (1) - verify the talos data crunching, just for 5280 test runs which we need here which are inside 1760 log files.

4. Take those suite summaries from all the logs and compute the overview comparison page independently (that's task (3)).


* assuming each test has 5 subtests on average and we run each test 10 cycles on average - 50 replicates per test, over 3 tests per job = 150 replicates, where there's also a subtest summary tier of 15 values from the 150 replicates, overall each talos job produces 3 + 15 unique values, where the 3 suite summaries contribute to the overall compare view, and the 15 subtests summaries are used throughout the 132 pages of subtest comparisons.


----------------------

So, for the concrete example above:

- fetching a single talos log and verify the talos crunching in it is easy.

- verifying that 5280 known input values produce the overall page view is also easy.

- (for each of 132 subtest view pages) verifying the view from 200 subtests summary values (5 subtests * 20 retriggers * two revisions) is also easy.


The main issue is figuring out where these 1760 talos logs are, and then actually fetching and processing all the 900M of them.

This would produce 5280 suit summary values + 132 * 200 subtest summary values = overall ~32K values (from 900M raw input), which should then be used to verify a single overview page and 132 subtests pages, all of which are the output of comparing these two revisions.

IMO this is not practical, and we should instead either only sample the data somehow, or rely on earlier processing - possibly the ones which treeherder did when collecting all those logs.

But then, can we still say that we're verifying end to end?

It seems like there there's a huge cruft/useful-data ratio between the number of input and output data points in each talos log to the log size, maybe we should attack the problem by reducing this ratio somehow?

Afteral, each talos job has 168 useful numbers in it: 150 replicate values + 3 suit summary values + 15 subtests summary values (on average, of course). That's it. 168 numbers. For 1760 jobs that's ~300K values (of which only 32k used as inputs for the views, and the other 270K values are used to generate those 32K values).

That's easily manageable, scale wise, but mining those values independently to verify end to end just doesn't seem practical to me.

How do we attack this?
You may ignore the braindump which is comment 5. I'll post a shorter version tomorrow.
(In reply to Avi Halachmi (:avih) from comment #7)
> You may ignore the braindump which is comment 5.

Comment 6 obviously. Less tired now, so I hope I can make it more focused:
--------------

I don't think it's practical to test end-to-end and I think we'll need to think of a different approach.

On a concrete example which I analyzed (the first comment at bug 1190664), one end is 900M of talos logs split between ~1700 log files, and the other end is the output of 133 different and unique web pages which generate their outputs on the client side.

If we wanted to validate only the "main" overview page on the output end, then this end would be the output generated on a single web page instead of 133, but the input end remains the same (900M over 1700 log files).


I see three main issues:
A. A very low signal to noise ratio on the talos logs end.
B. The number of unique pages we'd need to validate on the other end.
C. The fact that in order to connect the dots between both ends we'd need to trust the core of the system we're trying to verify (treeherder).


A. Signal to noise on the talos end:

- The signal is 300K usable values (numbers) at the talos logs:
  -   6K suit summaries, used as inputs to the overview comparison page.
  -  25K subtest summaries, used as inputs to the different subtest pages.
  - 270K replicates, used to generate the above 31K values and which
         we want to verify, but otherwise they're not used elsewhere.

Noise #1 is the log files size - 500K on average for a single log file from which we need to collect ~160 numbers.

Noise #2 is the fact that we need to independently grab ~1700 such talos logs.


B. Is clear, in order validate 133 web pages with client-side generated values, a user will need to open all those pages, or we need to automate it.

C. Is also clear, if we want to verify the outputs, we need to know which inputs we should use, but to do that, we have to trust threeherder to tell us that - and treeherder is part of the system we're trying to verify.


So the questions we need to answer IMO are:

A. How can we increase the signal to noise ratio on the talos end while being reasonably confident that this reduction of noise is valid? I.e. we need 300K numbers we can trust, instead of 900M of logs split between 1700 locations.

B. I think we answered this one earlier - for now, we'll verify each output on demand. I.e. a user chooses a compare view web page, either the main one or a subtest view, and we'll validate that.

C. To what degree should the verification system trust treeherder?
(In reply to Avi Halachmi (:avih) from comment #8)
> So the questions we need to answer IMO are:
> A. ...

D. Maybe we want to give up on the end-to-end verification and instead verify only pieces of each tier on demand:

1. Given a specific talos log, verify that talos crunched the values correctly within this log (there are no calculations of values between different talos logs except in compare-view pages - (2) below)

2. Given a specific compare view page, verify that the page crunched its inputs correctly.

3. Trust treeherder to correctly route outputs from talos to inputs of each compare-view page.


This would not cover end to end, and will not be able to verify the treeherder data routing, but will let us verify concrete pieces which we might suspect.
As discussed in today's perf-testing meeting, let's pursue approach D described in comment 9
Flags: needinfo?(jmaher)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.