disable MOZ_GTEST_BENCH tests as we are not looking at the data

RESOLVED FIXED in Firefox 55

Status

Testing
General
RESOLVED FIXED
11 months ago
6 months ago

People

(Reporter: jmaher, Assigned: jmaher)

Tracking

Trunk
mozilla55
Points:
---

Firefox Tracking Flags

(firefox55 fixed)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

11 months ago
A while back we created a method for microbenchmarks in the platform.  This was built as a way for developers to create small metrics in unittests and upload the data to perfherder.  In fact we have a bunch of alerts:
https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=6&hideTo=1&page=1

the problem is these alerts have not been addressed in months and we have no confidence anyone is looking at the data.  

you can see where we define the tests:
https://dxr.mozilla.org/mozilla-central/search?q=MOZ_GTEST_BENCH

In an ideal state this will be resolved as WONTFIX and we will find someone who will track the results regularly and make sure they are useful and actionable.

Another solution is to run the tests as unittests and disable the perfherder uploading if we are not tracking the values.
We were aware of the URL regression from other reports, and it was fixed. Is it possible to set up email notifications for bench tests?
(Assignee)

Comment 2

11 months ago
We would need to have email addresses for each test defined- probably not impossible- we do have a dashboard for alerts:
https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=6&page=1

I know in the early days of this the GfxBench results were so noisy they were not providing value- I see for TestStandardURL the results look stable.
(In reply to Joel Maher ( :jmaher) from comment #2)
> We would need to have email addresses for each test defined- probably not
> impossible- we do have a dashboard for alerts:
> https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=6&page=1
> 
> I know in the early days of this the GfxBench results were so noisy they
> were not providing value- I see for TestStandardURL the results look stable.

So as we discussed on irc, our experience with email alerts has been pretty negative (i.e. no one pays attention to them). IMO, we should find someone who is actively interested in sheriffing/triaging this data or just turn off the tests.
I had no idea these were around (OK, at some level, I'm sure I knew :)

Why can't these be handled the same way that talos regressions are handled?  Open a bug, backout or explain.

Comment 5

11 months ago
From my experience monitoring these they are incredibly noisy for seemingly no reason. They would for instance point large regressions to innocent changeset in the build system likely potentially having changed the offset in the binary. I was hoping that in the long run they might provide useful trends for micro-operations, particularly for something like RGBX blend to RGBA, but as-is they were a lot of work to sheriff.

I'm sure with some efforts they could probably be turned into something more useful that they are now. But I can see why ATM they're not as useful like I intended.
(In reply to Benoit Girard (:BenWa) from comment #5)
> From my experience monitoring these they are incredibly noisy for seemingly
> no reason. They would for instance point large regressions to innocent
> changeset in the build system likely potentially having changed the offset
> in the binary.

Is this because the operations they were measuring were tiny enough that small amounts of noise drown out the signal? Rust/Cargo |bench| has a pretty mechanism for running small operations lots of times until the number stablize. Shing has hooked these up to Servo's perfherder, so he might be able to share his experience on how noisy they are.

In general, there are two (mostly orthogonal) things about this setup that I like and would like to find a way to preserve and leverage more:
(A) The ability to measure specific operations in an isolated environment, without the noise of integration testing.
(B) The ability for developers to quickly and easily add performance tests (ideally as easily as it is to add other kinds of tests).

For Quantum CSS we really need something that lets us do (B) - We basically want to be able to add a simple html file (like a crashtest) and track the regressions in the page load time. This would be really helpful to avoid regression tricky style system optimizations and making certain workloads 10x or 100x slower.

Comment 7

11 months ago
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #6)
> Is this because the operations they were measuring were tiny enough that
> small amounts of noise drown out the signal?

Yes, and my theory is that any irrelevant change like cache behavior (boundaries, aliasing) or things of that sort can cause a large score impact but I never spent any time investigating when the patches were clearly unrelated to the microbenchmark code. 

> Rust/Cargo |bench| has a pretty
> mechanism for running small operations lots of times until the number
> stablize. Shing has hooked these up to Servo's perfherder, so he might be
> able to share his experience on how noisy they are.
> 

That's interesting and sounds useful. I do think it's a solvable problem if someone has the cycles to work on it. It didn't help that I added a few tests and left this thing aside. I'm sure if the problems were fixed and it was marketed a bit it would be vastly more useful.

But as-is if no one touches it I agree with :jmaher that it might not be worth running.
(Assignee)

Comment 8

8 months ago
:rwood, can you follow up here and determine if we should adjust the tests and sheriff them, or if we should turn them off.  Possibly there are good opportunities for tests to exist, but if they are sheriffs there should be documentation as to what they do and who to pester about them.
Flags: needinfo?(rwood)
Comment hidden (mozreview-request)
On the assumption that these aren't useful (I feel like the discussion here was pretty conclusive), I created a patch which (I think) should remove them from the tree. Ofc we can always bring them back in the future.
Comment hidden (mozreview-request)
Per IRC discussion, we very-coincidentally just found a really nice use for platform microbenchmarks.

Specifically, a large part of the time that Talos spends in the stylesystem is parsing stylesheets. This is totally decoupled from the DOM, and is just a single function (Servo_StyleSheet_FromUTF8Bytes). This would be a really nice thing to hook up with a test stylesheet or two to isolate regressions in parser performance.

wlach, can you help Simon get this spun up and then we can see how it goes?
(Assignee)

Comment 13

8 months ago
presumably we will need to run this on physical hardware.  If these are GTEST, then we need to run a separate GTEST job on hardware.  This can be done if we treat a specific job as hardware only (will need to be run via BBB on taskcluster).
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #12)

> wlach, can you help Simon get this spun up and then we can see how it goes?

I don't really understand the microbenchmarking code terribly well, but from what I can tell it's just a gtest variant.

You can probably model your approach after the existing tests, which were added here:

https://hg.mozilla.org/mozilla-central/rev/396639988f517bfc4e8a8a6b8afd88e3344a7bf7
(In reply to Joel Maher ( :jmaher) from comment #13)
> presumably we will need to run this on physical hardware.  If these are
> GTEST, then we need to run a separate GTEST job on hardware.  This can be
> done if we treat a specific job as hardware only (will need to be run via
> BBB on taskcluster).

I am not sure if this is true. Experience suggests the main determinant of whether a performance test is useful is whether or not it has a gaussian (or close to gaussian) distribution. So long as we can get that on the VM (and it's at least somewhat representative of what we'd see in the field), we might be ok.

Comment 16

8 months ago
(In reply to William Lachance (:wlach) (use needinfo!) from comment #10)
> On the assumption that these aren't useful (I feel like the discussion here
> was pretty conclusive), I created a patch which (I think) should remove them
> from the tree. Ofc we can always bring them back in the future.

I agree, it's been 4 months since the initial bug was filed so looks like there's not much interest (at least in the existing suite).
Flags: needinfo?(rwood)
(Assignee)

Comment 17

6 months ago
I would like to finish this up, :rwood, can you get an updated patch for review?
Flags: needinfo?(rwood)
Wait, hang on. My recollection from the call we had 2 months ago was that we decided to keep these, which is why Simon used this when implementing bug 1355028.
(Assignee)

Comment 19

6 months ago
ok, I didn't see that in this bug, can we turn off the benchmarks that we cannot get value from then?  We have hundreds of alerts/month which is just noise.
(In reply to Joel Maher ( :jmaher) from comment #19)
> ok, I didn't see that in this bug, can we turn off the benchmarks that we
> cannot get value from then?  We have hundreds of alerts/month which is just
> noise.

In general, I have less of an objection to turning off other people's benchmarks, and mostly care about ours. :-)

That said, it's probably worth preserving the ones that turn out to be stable. Our benchmarks (the Parser ones) appear to be pretty stable, so presumably some of the other ones are too.
Glanced at the alerts (https://treeherder.mozilla.org/perf.html#/alerts?status=-1&framework=6&hideDwnToInv=1&page=1) a bit. My unsolicited advice is:

* Turn off running these (or at least alerting) on all platforms except one (maybe linux64). linux32 in particular seems noisy and not particularly useful.
* Remove the original compositor set of tests that :BenWa added, since even he said they weren't useful (this is covered by the patch attached to this bug, minus the bits that turn off microbenchmarks altogether)

Going forward I would suggest:

* Have a strict requirement that tests in this suite report numbers with a normal distribution -- tests that produce multimodal distributions should either be turned off or at least made not to alert (this policy would have caught the compositor tests I mention above, which seem responsible for a large % of the noise)

The combination of these things should make sheriffing these tests manageable, I think. Remember that while more coverage / data might seem to be helpful, we are at the point where the limiting factor is people to look at it. Alerting/sheriffing on more data comes at the cost of a timely response on more useful information (e.g. talos).
(Assignee)

Comment 22

6 months ago
I am happy to keep the new tests and happy to get rid of the old graphics tests.
(Assignee)

Comment 23

6 months ago
Created attachment 8873410 [details] [diff] [review]
disable gfx related platform_microbenchmarks

this removed just the gfx related microbenchmarks, we still have a few more:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=69cc32d81abec51f1c90827b8ca4221c6b4da3fc

I think this will greatly reduce our alerts and we should then analyze the data for the rest of the tests and determine which provide value.  Possibly we keep this to linux only, but it is nice to compare against other platforms.
Assignee: nobody → jmaher
Attachment #8854986 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Flags: needinfo?(rwood)
Attachment #8873410 - Flags: review?(wlachance)
Comment on attachment 8873410 [details] [diff] [review]
disable gfx related platform_microbenchmarks

Review of attachment 8873410 [details] [diff] [review]:
-----------------------------------------------------------------

Looks ok to me, though I really think we should also stop running these on linux32 at least.
Attachment #8873410 - Flags: review?(wlachance) → review+
Incidentally while looking through the old microbenchmark alerts, I found what seems to be a real regression in the string microbenchmarks:

https://bugzilla.mozilla.org/show_bug.cgi?id=1361724#c6

This underlines to me the need to figure out what subset of these we can run to get an actual signal out of them. When there's just a huge pile of invalid data it's very tempting to just mass mark invalid/wontfix.
(Assignee)

Comment 26

6 months ago
I am not sure how to avoid running this on linux32, do you think the gtest itself shouldn't be run on linux32?

I do agree, I regularly flush out alerts here- until we have confidence all alerts coming in are at least 80% useful, they all get trashed.
(In reply to Joel Maher ( :jmaher) from comment #26)
> I am not sure how to avoid running this on linux32, do you think the gtest
> itself shouldn't be run on linux32?
> 
> I do agree, I regularly flush out alerts here- until we have confidence all
> alerts coming in are at least 80% useful, they all get trashed.

The easiest thing I can think of would be to have reporting to perfherder controlled by an environment variable which is only set on platforms which we want to report on (or vice versa). I don't think that should be all that difficult.

Comment 28

6 months ago
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d3633639df8
disable MOZ_GTEST_BENCH tests as we are not looking at the data. r=wlach
(Assignee)

Updated

6 months ago
Depends on: 1369807

Comment 29

6 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/8d3633639df8
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.