Closed Bug 696196 Opened 9 years ago Closed 6 years ago
Compare-talos is completely broken by pgo/non-pgo split
I know this isn't strictly a Talos bug, but we don't have a component for Compare-Talos. Check out http://goo.gl/LwuOA. Hit show details at the top, and look at Dromaeo CSS. The reported score is 3105.28 (min: 2736.49, max: 3472.94) That's a pretty big difference! Hm. Looking at TBPL  reveals the source of the difference. Compare-talos doesn't understand that the Linux-PGO and Linux-NonPGO builds are different, and lumps both tests' scores together. Ouch. This is dangerous breakage, and it makes compare-talos basically useless. If you compare to a revision which has PGO, everything will look like a regression. I'm tempted to say we should take down the tool until we fix this, so people don't waste their time chasing regressions which appear and disappear depending on which original revs they pick. Also, someone needs to own this tool.  https://tbpl.mozilla.org/?rev=67673422f7d2
https://bitbucket.org/mconnor/compare-talos is the right place to File compare-talos bugs.
Thanks, Shawn. https://bitbucket.org/mconnor/compare-talos/issue/13/compare-talos-doesnt-understand-pgo-non btw, this can also cause you to miss regressions, since compare-talos will compute that the test has a very large variance.
The only way it's a compare-talos bug is if a graphserver API exists which will tell compare-talos that the results from a ts_foo run on a PGO build are on a different platform than ts_foo run on a non-PGO build on the same actual platform. The API compare-talos currently uses says, in http://graphs-old.mozilla.org/api/test/runs/revisions?revision=798b00a6fe29 for example, that the ts_paint runs on "Fedora 12 - Constantine" were (rounding) 566, 570, 502, 496, 495, 501, 497, 504, 496, 501. I don't know why there are twice as many as it seems like there should be, but the odd-numbered ones are the results from the on-push non-PGO build, then the PGO build at 00:00, then the PGO nightly, then the PGO build at 06:00, then the PGO build at 12:00. It's not just the case that you can't compare a rev with PGO to a rev without PGO, you also can't compare PGO to PGO, only the average of (PGO + non-PGO) to another average of (PGO + non-PGO), and you can only compare csets which have had the same number of PGO runs on them.
Component: Talos → Graph Server
Product: Testing → Webtools
QA Contact: talos → graph.server
Version: unspecified → other
(In reply to Phil Ringnalda (:philor) from comment #3) > The only way it's a compare-talos bug is if a graphserver API exists which > will tell compare-talos that the results from a ts_foo run on a PGO build > are on a different platform than ts_foo run on a non-PGO build on the same > actual platform. Can you describe what a graphserver API that would be useful to compare-talos would look like? I don't fully understand the rest of your comment :) Are PGO and non-PGO results getting mixed together? Based on the schema I don't see why we shouldn't be able to do this, but I could use some pointers on what it should look like (I am not that familiar with compare-talos or the API that it uses, and how exactly the PGO/non-PGO changes broke that)
I'm no compare-talos expert, having probably not looked at it since last December, and I think it's getting reinvented anyway, but yes, that API, the one that it uses because at the time that that API got forward-ported from graphs-old to graphs nothing else was actually providing what it needed, only has one field to say what a build is, the platform, and it treats WINNT 6.1 as a single platform which for reasons it doesn't understand sometimes gets one run of a particular talos suite, and other times gets two runs of that talos suite. The reason which it doesn't understand is that WINNT 6.1 Non-PGO and WINNT 6.1 PGO are separate builds, which since it only has platform to use to distinguish, it should be calling WINNT 6.1 Non-PGO and WINNT 6.1 PGO.
I've had the main breakdown "fixed" (but not verified) for many months using datazilla data but I never got around to fixing the details/breakdown page so didn't push it live. I'll take a stab at it now since :avig wants this fixed.
Assignee: nobody → MattN+bmo
Status: NEW → ASSIGNED
I have a much improved version of compare-talos at http://compare-talos.mattn.ca/dev/ which now uses datazilla and keeps PGO and non-PGO separate. The numbers are different than graph server since graph server did some filtering e.g. drop the highest value which I don't think happens to the raw data from DZ. If somebody could confirm the data on my dev. version matches datazilla, then I will make it the production version and upstream the fix. avih or jmaher, could either of you help with this?
after I enter two revisions, I get a breakdown as usual with the test overview on the right side and the original form on the left side. Then I click the details button and the test overview takes up the whole screen (not details) and the form is overlayed, no details links are clickable.
I like it. Though manually choosing the branch, especially if results are only available on one branch feels a bit.. tiresome. I compared the numbers between the the old/new compare-talos (CT), with this "old" link (taken from bug 994712 comment 23): http://compare-talos.mattn.ca/?oldRevs=722a4c57999b&newRev=04c13d9470f3&server=graphs.mozilla.org&submit=true And then converted it to the new CT link (had to manually choose the "Try" branch"): http://compare-talos.mattn.ca/dev/?oldBranch=Try&oldRevs=722a4c57999b&newBranch=Try&newRev=04c13d9470f3&submit=true While the change % (regressions/improvements) look similar (though not identical) between new and old CT, some values look very different. E.g. (test: old CT value/new CT value): CART, OS X 10.6: 70ms / 110ms CART, OS X 10.8: 47ms / 76ms tscrollx, OS X 10.6: 6ms / 14ms tscrollx, OS X 10.8: 4ms / 7ms Because the difference can be quite big, I can't say if the new is "correct" without trying to follow the actual numbers CT uses for calculating them. E.g. for 2x difference between old/new CT, if graphserver ignores one run of 25 and datazilla ignores none, then that single ignored run would have to be 50x worse than the rest, but I don't recall seeing this magnitude of difference between the first and rest of the runs while browsing datazilla directly - usually the first is "only" around 1.5x slower. Would it be possible to add some debug div where it'll output all the numbers it's using for the calculation? (preferably for both new/old CT). So my hunch is that something might be off here. The following might be a bit out of scope for this specific bug (which deals with diffing non/PGO CT results), but for now it's the only place to discuss graphserver/datazilla differences in CT. If the new CT results are indeed "correct", it seems very meaningful. Though not sure which kind of stats should/would be considered better for us. I do know, however, that for some cases we should care about the first result while for others we could care less. E.g. tab animation happens a lot, so the first one is not too important, but customize animation doesn't happen much, so the first time IS important - while probably ignored on graphserver. OTOH, the difference between first/worst run and the rest of the runs might be so big that in datazilla averages the worst's run magnitude + noise could mask smaller changes in the rest of the runs. No definite conclusion here on which kind of stats is better, but a good first step would be to try and validate that the new CT stats at least use the correct numbers.
(In reply to Avi Halachmi (:avih) from comment #10) > E.g. for 2x difference between old/new CT, if graphserver ignores one run of > 25 and datazilla ignores none, then that single ignored run would have to be > 50x worse than the rest ... Sorry, the worst would have to be ~25x worse than the rest. But still not close to anything I've seen on datazilla.
I think I misinterpreted the results. Apparently the main difference between the GMO CT and DZ CT is that graphserver (on the server side) excludes the "worst" page when it calculates the overall average of the pages (and the derived overall % changes which CT displays on the main page), while DZ take all into account. This change alone accounts for 99% of the differences between the GS numbers and the DZ numbers in CT. But also after we take that into account, there's still a very minor diff between the GS and DZ numbers. It should probably be attributed to the fact that DZ calculates the value of each page as a median of all the 25 runs, while graphserver values usually excludes the first 1-5 results (of 25) per page, and then does median. This filtering for graphserver is done at talos - before the results are sent to graphserver. So, overall, I think the new compare talos is good. It shows different values because it takes more results into account. FWIW, categorically, I think the datazilla numbers are better.
Marking this as resolved then as I deployed the Datazilla version in production at https://compare-talos.paas.mozilla.org/ This is going to get replaced soon by PerfHerder comparisons anyways.
You need to log in before you can comment on or make changes to this bug.