1323238 - disable MOZ_GTEST_BENCH tests as we are not looking at the data

Assignee

Description

•

9 years ago

A while back we created a method for microbenchmarks in the platform. This was built as a way for developers to create small metrics in unittests and upload the data to perfherder. In fact we have a bunch of alerts: https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=6&hideTo=1&page=1 the problem is these alerts have not been addressed in months and we have no confidence anyone is looking at the data. you can see where we define the tests: https://dxr.mozilla.org/mozilla-central/search?q=MOZ_GTEST_BENCH In an ideal state this will be resolved as WONTFIX and we will find someone who will track the results regularly and make sure they are useful and actionable. Another solution is to run the tests as unittests and disable the perfherder uploading if we are not tracking the values.

Valentin Gosu [:valentin]

Comment 1

•

9 years ago

We were aware of the URL regression from other reports, and it was fixed. Is it possible to set up email notifications for bench tests?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 2

•

9 years ago

We would need to have email addresses for each test defined- probably not impossible- we do have a dashboard for alerts: https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=6&page=1 I know in the early days of this the GfxBench results were so noisy they were not providing value- I see for TestStandardURL the results look stable.

William Lachance (:wlach)

Comment 3

•

9 years ago

(In reply to Joel Maher ( :jmaher) from comment #2) > We would need to have email addresses for each test defined- probably not > impossible- we do have a dashboard for alerts: > https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=6&page=1 > > I know in the early days of this the GfxBench results were so noisy they > were not providing value- I see for TestStandardURL the results look stable. So as we discussed on irc, our experience with email alerts has been pretty negative (i.e. no one pays attention to them). IMO, we should find someone who is actively interested in sheriffing/triaging this data or just turn off the tests.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 4

•

9 years ago

I had no idea these were around (OK, at some level, I'm sure I knew :) Why can't these be handled the same way that talos regressions are handled? Open a bug, backout or explain.

Benoit Girard (:BenWa)

Comment 5

•

9 years ago

From my experience monitoring these they are incredibly noisy for seemingly no reason. They would for instance point large regressions to innocent changeset in the build system likely potentially having changed the offset in the binary. I was hoping that in the long run they might provide useful trends for micro-operations, particularly for something like RGBX blend to RGBA, but as-is they were a lot of work to sheriff. I'm sure with some efforts they could probably be turned into something more useful that they are now. But I can see why ATM they're not as useful like I intended.

Bobby Holley (:bholley)

Comment 6

•

9 years ago

(In reply to Benoit Girard (:BenWa) from comment #5) > From my experience monitoring these they are incredibly noisy for seemingly > no reason. They would for instance point large regressions to innocent > changeset in the build system likely potentially having changed the offset > in the binary. Is this because the operations they were measuring were tiny enough that small amounts of noise drown out the signal? Rust/Cargo |bench| has a pretty mechanism for running small operations lots of times until the number stablize. Shing has hooked these up to Servo's perfherder, so he might be able to share his experience on how noisy they are. In general, there are two (mostly orthogonal) things about this setup that I like and would like to find a way to preserve and leverage more: (A) The ability to measure specific operations in an isolated environment, without the noise of integration testing. (B) The ability for developers to quickly and easily add performance tests (ideally as easily as it is to add other kinds of tests). For Quantum CSS we really need something that lets us do (B) - We basically want to be able to add a simple html file (like a crashtest) and track the regressions in the page load time. This would be really helpful to avoid regression tricky style system optimizations and making certain workloads 10x or 100x slower.

Benoit Girard (:BenWa)

Comment 7

•

9 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #6) > Is this because the operations they were measuring were tiny enough that > small amounts of noise drown out the signal? Yes, and my theory is that any irrelevant change like cache behavior (boundaries, aliasing) or things of that sort can cause a large score impact but I never spent any time investigating when the patches were clearly unrelated to the microbenchmark code. > Rust/Cargo |bench| has a pretty > mechanism for running small operations lots of times until the number > stablize. Shing has hooked these up to Servo's perfherder, so he might be > able to share his experience on how noisy they are. > That's interesting and sounds useful. I do think it's a solvable problem if someone has the cycles to work on it. It didn't help that I added a few tests and left this thing aside. I'm sure if the problems were fixed and it was marketed a bit it would be vastly more useful. But as-is if no one touches it I agree with :jmaher that it might not be worth running.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 8

•

8 years ago

:rwood, can you follow up here and determine if we should adjust the tests and sheriff them, or if we should turn them off. Possibly there are good opportunities for tests to exist, but if they are sheriffs there should be documentation as to what they do and who to pester about them.

Flags: needinfo?(rwood)

Comment hidden (mozreview-request)

William Lachance (:wlach)

Comment 10

•

8 years ago

On the assumption that these aren't useful (I feel like the discussion here was pretty conclusive), I created a patch which (I think) should remove them from the tree. Ofc we can always bring them back in the future.

Comment hidden (mozreview-request)

Bobby Holley (:bholley)

Comment 12

•

8 years ago

Per IRC discussion, we very-coincidentally just found a really nice use for platform microbenchmarks. Specifically, a large part of the time that Talos spends in the stylesystem is parsing stylesheets. This is totally decoupled from the DOM, and is just a single function (Servo_StyleSheet_FromUTF8Bytes). This would be a really nice thing to hook up with a test stylesheet or two to isolate regressions in parser performance. wlach, can you help Simon get this spun up and then we can see how it goes?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 13

•

8 years ago

presumably we will need to run this on physical hardware. If these are GTEST, then we need to run a separate GTEST job on hardware. This can be done if we treat a specific job as hardware only (will need to be run via BBB on taskcluster).

William Lachance (:wlach)

Comment 14

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #12) > wlach, can you help Simon get this spun up and then we can see how it goes? I don't really understand the microbenchmarking code terribly well, but from what I can tell it's just a gtest variant. You can probably model your approach after the existing tests, which were added here: https://hg.mozilla.org/mozilla-central/rev/396639988f517bfc4e8a8a6b8afd88e3344a7bf7

William Lachance (:wlach)

Comment 15

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #13) > presumably we will need to run this on physical hardware. If these are > GTEST, then we need to run a separate GTEST job on hardware. This can be > done if we treat a specific job as hardware only (will need to be run via > BBB on taskcluster). I am not sure if this is true. Experience suggests the main determinant of whether a performance test is useful is whether or not it has a gaussian (or close to gaussian) distribution. So long as we can get that on the VM (and it's at least somewhat representative of what we'd see in the field), we might be ok.

Robert Wood [:rwood]

Comment 16

•

8 years ago

(In reply to William Lachance (:wlach) (use needinfo!) from comment #10) > On the assumption that these aren't useful (I feel like the discussion here > was pretty conclusive), I created a patch which (I think) should remove them > from the tree. Ofc we can always bring them back in the future. I agree, it's been 4 months since the initial bug was filed so looks like there's not much interest (at least in the existing suite).

Flags: needinfo?(rwood)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 17

•

8 years ago

I would like to finish this up, :rwood, can you get an updated patch for review?

Flags: needinfo?(rwood)

Bobby Holley (:bholley)

Comment 18

•

8 years ago

Wait, hang on. My recollection from the call we had 2 months ago was that we decided to keep these, which is why Simon used this when implementing bug 1355028.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 19

•

8 years ago

ok, I didn't see that in this bug, can we turn off the benchmarks that we cannot get value from then? We have hundreds of alerts/month which is just noise.

Bobby Holley (:bholley)

Comment 20

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #19) > ok, I didn't see that in this bug, can we turn off the benchmarks that we > cannot get value from then? We have hundreds of alerts/month which is just > noise. In general, I have less of an objection to turning off other people's benchmarks, and mostly care about ours. :-) That said, it's probably worth preserving the ones that turn out to be stable. Our benchmarks (the Parser ones) appear to be pretty stable, so presumably some of the other ones are too.

William Lachance (:wlach)

Comment 21

•

8 years ago

Glanced at the alerts (https://treeherder.mozilla.org/perf.html#/alerts?status=-1&framework=6&hideDwnToInv=1&page=1) a bit. My unsolicited advice is: * Turn off running these (or at least alerting) on all platforms except one (maybe linux64). linux32 in particular seems noisy and not particularly useful. * Remove the original compositor set of tests that :BenWa added, since even he said they weren't useful (this is covered by the patch attached to this bug, minus the bits that turn off microbenchmarks altogether) Going forward I would suggest: * Have a strict requirement that tests in this suite report numbers with a normal distribution -- tests that produce multimodal distributions should either be turned off or at least made not to alert (this policy would have caught the compositor tests I mention above, which seem responsible for a large % of the noise) The combination of these things should make sheriffing these tests manageable, I think. Remember that while more coverage / data might seem to be helpful, we are at the point where the limiting factor is people to look at it. Alerting/sheriffing on more data comes at the cost of a timely response on more useful information (e.g. talos).

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 22

•

8 years ago

I am happy to keep the new tests and happy to get rid of the old graphics tests.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 23

•

8 years ago

Attached patch disable gfx related platform_microbenchmarks — Details — Splinter Review

this removed just the gfx related microbenchmarks, we still have a few more: https://treeherder.mozilla.org/#/jobs?repo=try&revision=69cc32d81abec51f1c90827b8ca4221c6b4da3fc I think this will greatly reduce our alerts and we should then analyze the data for the rest of the tests and determine which provide value. Possibly we keep this to linux only, but it is nice to compare against other platforms.

Assignee: nobody → jmaher

Attachment #8854986 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Flags: needinfo?(rwood)

Attachment #8873410 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 24

•

8 years ago

Comment on attachment 8873410 [details] [diff] [review] disable gfx related platform_microbenchmarks Review of attachment 8873410 [details] [diff] [review]: ----------------------------------------------------------------- Looks ok to me, though I really think we should also stop running these on linux32 at least.

Attachment #8873410 - Flags: review?(wlachance) → review+

William Lachance (:wlach)

Comment 25

•

8 years ago

Incidentally while looking through the old microbenchmark alerts, I found what seems to be a real regression in the string microbenchmarks: https://bugzilla.mozilla.org/show_bug.cgi?id=1361724#c6 This underlines to me the need to figure out what subset of these we can run to get an actual signal out of them. When there's just a huge pile of invalid data it's very tempting to just mass mark invalid/wontfix.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 26

•

8 years ago

I am not sure how to avoid running this on linux32, do you think the gtest itself shouldn't be run on linux32? I do agree, I regularly flush out alerts here- until we have confidence all alerts coming in are at least 80% useful, they all get trashed.

William Lachance (:wlach)

Comment 27

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #26) > I am not sure how to avoid running this on linux32, do you think the gtest > itself shouldn't be run on linux32? > > I do agree, I regularly flush out alerts here- until we have confidence all > alerts coming in are at least 80% useful, they all get trashed. The easiest thing I can think of would be to have reporting to perfherder controlled by an environment variable which is only set on platforms which we want to report on (or vice versa). I don't think that should be all that difficult.

Pulsebot

Comment 28

•

8 years ago

Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/8d3633639df8 disable MOZ_GTEST_BENCH tests as we are not looking at the data. r=wlach

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

8 years ago

Depends on: 1369807

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 29

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/8d3633639df8

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Bug 1323238 - Backout platform microbenchmarks 8 years ago William Lachance (:wlach) 59 bytes, text/x-review-board-request		Details
disable gfx related platform_microbenchmarks 8 years ago Joel Maher ( :jmaher ) (UTC -8) 46.77 KB, patch	wlach : review+	Details \| Diff \| Splinter Review