Closed Bug 987136 Opened 10 years ago Closed 9 years ago

Replace dromaeo tests with something more useful?

Categories

(Testing :: Talos, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: smaug, Assigned: jmaher)

References

Details

Attachments

(3 files)

Large parts of dromaeo aren't testing anything too useful.
So, would be good if we replaced it with something better.
What that better is, I don't know.
should we not measure parts of dromaeo and keep what we want?  maybe this bug is to create a new benchmark that is more useful with current technologies?
is this in dromaeo dom or dromaeo css?  I ask because we have a common crash on windows 7 pgo in dromaeo_dom, maybe removing some tests now which are not useful could be a good thing :)
The tests are useful for detecting regressions.

The tests are not useful for measuring progress on improving the DOM implementation.

What is the crash?
sorry, I should have linked the bug originally.  It is bug 872788.
Ah.  We should fix the broken test, I would think....
I am open to trying out any fixes to dromaeo, adding/removing tests, etc.

A little direction would be nice.
Joel, there's a bunch of probably-relevant info in https://bugs.webkit.org/show_bug.cgi?id=95376#c4 and following.  It's cited in bug 872788, but may be kinda lost in all the tbpl robot spew.  :(
Funny, I was thinking recently about pushing harder to fix or shutoff dromaeo on Win7 due to bug 872788. Pretty absurd that we can tolerate a test suite OOMing that often without any sense of urgency. And we already killed it on XP for the same reason, so the precedent is there.
bz: I have been working on this with no luck.  you can see how frequently we crash:
https://tbpl.mozilla.org/?tree=Try&rev=14720d0262e3

all of the reds are on dromaeo_dom and look to be related to alloc issues or OOM.

There is 1 block of 5 greens, this is where I turned dromaeo_dom off and ran dromaeo_css, kraken, v8_7- they seem to be solid.  We don't have this problem on non-pgo, nor this issue on linux pgo.

I took some time and looked at the patch on https://bugs.webkit.org/show_bug.cgi?id=95376#c4.  I have applied this for all pushes you see there.  To be truthful, I had a typo at one point, so a couple of reds are related to it hanging and then timing out.

here are the changes I have made:
https://pastebin.mozilla.org/4948440

I would love to try other changes so we can keep this benchmark open.

:bz, any ideas here?
Flags: needinfo?(bzbarsky)
Talked to Joel on IRC.  The crashes in that try push all seem to be under the dom-modify test, not the jquery tests.  In particular, in the cloneNode test there.  I filed bug 1001565 on that, to localize the noise.
Flags: needinfo?(bzbarsky)
Blocks: 1040081
(In reply to Boris Zbarsky [:bz] from comment #3)
> The tests are useful for detecting regressions.
> 
> The tests are not useful for measuring progress on improving the DOM
> implementation.
> 
> What is the crash?

How is this the case? Is it because they behave like unit-tests as far as DOM goes? At which case, they should not be part of performance testing, right?

Also, FWIW and regardless of the DOM relevance, IIRC Vladan mentioned that the JS guys do like this suite for measuring performance.

Maybe the solution would be to split this test into two:

1. Unit test for DOM conformance - moved to mochitest or some other unit test facility.

3. JS performance test.
Flags: needinfo?(bzbarsky)
> Is it because they behave like unit-tests as far as DOM goes?

No, it's because they behave as performance regression tests.

In particular, some of the Dromaeo subtests are useless for measuring performance progress because we more or less completely optimize them out by loop-hoisting things.  But they're useful for catching regressions in the loop-hoisting machinery and other parts of the JIT.  These tests end up dominating the overall score because of how Dromaeo is scored.

> Maybe the solution would be to split this test into two:

It's currently functioning as mostly your #2, with smidgeons of actual DOM performance thrown in.
Flags: needinfo?(bzbarsky)
Depends on: 1022239
In bug 1099591, we yet again find more noise or odd side effects of dromaeo dom.  In fast recently there have been a string of bugs against dromaeo dom which are not related.  Dromaeo DOM as it is implemented yields a noisy pattern and goes up and down quite often.

Is there value in continuing to run dromaeo DOM?

In the last 9 months of tracking regressions we have 15 bugs associated with dromaeo DOM:
https://bugzilla.mozilla.org/show_bug.cgi?id=979875 - WONTFIX (10.6)
https://bugzilla.mozilla.org/show_bug.cgi?id=619558 - ggc / partially fixed - many tests affected
https://bugzilla.mozilla.org/show_bug.cgi?id=990140 - WONTFIX (10.6)
https://bugzilla.mozilla.org/show_bug.cgi?id=994692 - NEW (10.6) since April
https://bugzilla.mozilla.org/show_bug.cgi?id=1039611 - NEW (Win*) PGO changes/ since July
https://bugzilla.mozilla.org/show_bug.cgi?id=1042110 - WONTFIX (linux32)
https://bugzilla.mozilla.org/show_bug.cgi?id=1050705 - NEW (win7|8) since July
https://bugzilla.mozilla.org/show_bug.cgi?id=1050260 - NEW (10.8) since July
https://bugzilla.mozilla.org/show_bug.cgi?id=1057135 - FIXED (without investigating by other followup)
https://bugzilla.mozilla.org/show_bug.cgi?id=1086539 - NEW (Win*) since Oct 16
https://bugzilla.mozilla.org/show_bug.cgi?id=1088231 - WONTFIX (linux32)
https://bugzilla.mozilla.org/show_bug.cgi?id=1089741 - WONTFIX (linux32)

So there is no apparent value we are getting from this test since February.  We get a lot of notifications but it is probably randomizing us more than it is helping us.

Can we just disable dromaeo DOM across the board?  Do we have a viable replacement?  Am I just making a case with limited data and a limited point of view?
Flags: needinfo?(bzbarsky)
Flags: needinfo?(avihpit)
(In reply to Joel Maher (:jmaher) from comment #13)
> Can we just disable dromaeo DOM across the board?  Do we have a viable
> replacement?
Unfortunately I don't think there is any good performance test suite for DOM. I was hoping Speedometer would become such,
but at least initially it was very limited.

365 days view of Dromaeo DOM test suite can be still somewhat useful to at least see the generic trend and to catch regressions.
I think that since apparently there's some value in this test which is not reflected in other tests, then unless we have something useful to replace it with, it seems to me that bz and smaug would still vote to keep it. Whether or not it's worth the overhead/noise of leaving it running - is not something I can assess.

However, maybe someone could identify explicitly where this value is, and we maybe we could modify the test to only provide this value while somehow filtering out the noise?

Maybe completely ignoring some sub-tests?

bz, smaug, do you know which sub tests provide meaningful value? Any idea how to crunch the numbers such that we get the value but not the noise and false positives?
Flags: needinfo?(avihpit)
> yields a noisy pattern and goes up and down quite often

Do we know why?

Is it worth switching it from using "new Date" to "performance.now" to see if that helps?

> Is there value in continuing to run dromaeo DOM?

Yes.  It serves as a regression test for various JS issues.  :(

> Do we have a viable replacement?

No.  :(

We can work on such a replacement.  We can try the performance.now() thing.  We can think about only looking at cross-platform dromaeo-dom regressions, since nothing in there should really be platform-dependent, modulo compiler issues.

> bz, smaug, do you know which sub tests provide meaningful value?

Unfortunately, most of them.

Do we have any data on which subtests (or all of them?) are driving the noise and secular oscillations?
Flags: needinfo?(bzbarsky)
We have almost 2 months of data here showing the subtests:
http://people.mozilla.org/~klahnakoski/talos/Alert-Talos-Dromaeo.html#sampleMax=2014-11-17&sampleMin=2014-08-01&revision=&branch=Mozilla-Inbound&os=linux.Ubuntu+12.04&platform=x86_64

Traverse seems to be the most stable of the 4 dromaeo dom tests we have, The others have a wide range of values (tri-modal almost).  

Keep in mind that with the way our current system works, we drop the highest value (query.html), so we are only looking at 3 values which are then averaged together to produce the final dromaeo dom score.  

As far as I know we have not found a valid dromaeo DOM regression in the last year, quite possibly it is too noisy or sensitive to other changes- would it be viable to remove one or more of these tests so we can get some benefit from dromaeo dom?
> we drop the highest value (query.html)

Uh.. why would we do that?  That's totally broken!  It's basically just ignoring the query test, no?  Comparing the "highest" vs not when things have completely different scales is also totally nonsensical.  Can we fix this, please?

> As far as I know we have not found a valid dromaeo DOM regression in the last year

Good, so we're not breaking shit.  That's what this test is meant to ensure. ;)

That said, do we know what the story is with the drop around 9/1 on dom-traverse?  Followed by secular rise at that....

> would it be viable to remove one or more of these tests so we can get some benefit from
> dromaeo dom?

Can you define "benefit"?  To me, knowing that we have not regressed a certain set of code patterns is a benefit...

In any case, it looks like all the tests are reasonably noisy, with dom-traverse and dom-modify having the most secular drift.

Can we start with transitioning the tests to performance.now() and seeing how noise is affected?
the reason for pushing on this is we get a lot of noise from the alert system.  These are performance tests and should be treated as such, not as unit tests.  If we are getting a false positive alert every month, that wastes time for all parties involved.  My goal is to make this test useful, I see a few ways to that:
1) replace the test with something less sensitive to unrelated changes
2) modify the test to work better (i.e. performance.now(), etc.)
3) fix the reporting so that we don't get false positives
4) turn off the alert detection for dromaeo DOM
5) turn off dromaeo DOM completely

#2 is easy to do. #3 is going to get partially done with we switch to geometric mean for values instead of (drop highest value, average the rest) in bug 1077177

I am not looking for perfection, but I am expecting that we have fewer false positives with this test.
> These are performance tests and should be treated as such, not as unit tests.

Sure.  We're not treating them as unit tests.

> If we are getting a false positive alert every month,

Then we should try to fix that, yes.  Absolutely agreed.

> 1) replace the test with something less sensitive to unrelated changes

This is a large body of work.  But possibly useful, yes.

> 2) modify the test to work better (i.e. performance.now(), etc.)

We should try doing this.

> 3) fix the reporting so that we don't get false positives

Sure.  But are you saying that right now we report the _arithmetic_ mean of totally unrelated Dromaeo subtest scores?  Why???  The dromaeo harness normally reports the geometric mean; why did we start doing something completely different and totally wrong?  :(  The real issue there is not "false positives" but rather "the data reported is garbage", no?
> We should try doing this.

And to be clear, if you point me to the repo where the thing we actually run lives, I'm pretty happy to at least write a patch here.
as a first step, lets get performance.now() implemented.  This will be for both dromaeo dom and dromaeo css.

Related, there was an issue with the graphs we had for the 4 subtests of dromaeo, if you go to:
http://people.mozilla.org/~klahnakoski/talos/Alert-Talos-Dromaeo.html#sampleMax=2014-11-17&sampleMin=2014-08-01&revision=&branch=Mozilla-Inbound&os=linux.Ubuntu+12.04&platform=x86_64

or:
http://people.mozilla.org/~klahnakoski/talos/Alert-Talos-Dromaeo.html#sampleMax=2014-11-17&sampleMin=2014-08-01&revision=&branch=Mozilla-Inbound&os=win.6.1.7601&platform=x86

you can see the updated graphs where we don't have issues with platforms being mixed up.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8524675 - Flags: review?(wlachance)
Attachment #8524675 - Flags: feedback?(bzbarsky)
to answer the question about drop highest/average vs geometric mean:

talos for all tests (historically and now) takes a list of all tests which are run and applies this formula to it.  We know it is not useful and have been working (at a much lower priority) to switch to a geometric mean for all calculations.

Each of the 4 subtests in dromaeo do their own internal calculations to generate the number for that specific test page.
Comment on attachment 8524675 [details] [diff] [review]
adjust dromaeo webrunner to use performance.now (1.0)

Review of attachment 8524675 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense, though I'm personally skeptical that this will make a gigantic difference.
Attachment #8524675 - Flags: review?(wlachance) → review+
(In reply to Please do not ask for reviews for a bit [:bz] from comment #20)
> ...
> Sure.  But are you saying that right now we report the _arithmetic_ mean of
> totally unrelated Dromaeo subtest scores?  Why???  The dromaeo harness
> normally reports the geometric mean; why did we start doing something
> completely different and totally wrong?  :(  The real issue there is not
> "false positives" but rather "the data reported is garbage", no?

Yes. That's how graph server works. It was designed for uniform sub-tests values (such as page load times from alexa top NN sites), and indeed is incorrect for cases where each sub test measures a different thing and has a different scale.

Bug 1021842 should move graph server (graph view and regression alerts) to geometric mean without dropping the sub-test with the highest value.

The reason it's delayed is because we were supposed drop graph server (both for graphs and for alerts) and move to datazilla which doesn't average at all, and instead just monitors and alerts each sub-test individually.

But DataZilla has issues with the volumes of desktop performance data, and so now the plan is to integrate graphs and alerts into tree herder.
Attachment #8524675 - Flags: feedback?(bzbarsky) → feedback+
Blocks: 1101789
dromaeo dom does find useful errors once in a while, but the majority of alerts end up being noise or something unrelated to the patch that first exhibits the changed values.

I would like to adjust our graph server reporting to only report larger regressions (i.e. 10% or more) so we still find issues with dromaeo DOM, and developers and talos sheriffs don't have to be dealing with special cases for certain tests/platforms.

In graph server we have the concept of a percentage_threshold:
http://hg.mozilla.org/graphs/file/b00f49d8a764/server/analysis/analyze_talos.py#l599

Based on information in the repo, this appears to be 2%.  Quite possibly we could have a high_percentage_threshold and set it to 10 (I would be open to lower, but it needs to be >5) along with a list of tests which we use (i.e. high_percentage_tests, and it could be defined as we do reverse tests: http://hg.mozilla.org/graphs/file/b00f49d8a764/server/analysis/analysis.cfg.template#l59).

bz, thoughts on this?
Flags: needinfo?(bzbarsky)
Do we have an idea of what the false-positive regressions look like?  Are they major jumps in one test, or noise across multiple tests?

In an ideal world, I'd want to be notified of large (30%?) regressions on any subtest; if I had that restricting overall regressions to 10% would be fine.  But if we have random 30+% noise in subtests, then the world is not ideal...
Flags: needinfo?(bzbarsky)
This patch is for graph server to ignore dromaeo alerts unless they are >10%.
Attachment #8547591 - Flags: review?(catlee)
Attachment #8547591 - Flags: review?(catlee) → review+
Graph server changes are in tree- we will be rolling this out with geometric mean changes this week:
http://hg.mozilla.org/graphs/rev/599dc677be37
Attachment #8547739 - Flags: review?(catlee) → review+
typo fixed:
http://hg.mozilla.org/graphs/rev/3d9bb56cf4bb

this is deployed live.

I am not sure if there is anything left to do in this bug.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: