Closed Bug 1613956 Opened 5 years ago Closed 4 years ago

address noisy build metrics to reduce workload for perf sheriffs

Categories

(Firefox Build System :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Unassigned)

References

(Depends on 1 open bug)

Details

I had a meeting with Dave Hunt and his team today regarding build metric sherriffing
tldr: The build metrics that the perf sheriffing team looks at are very noisy and the bugs they open for the build team have a very low fix rate. The build size and sccache hit rate in particular are very noisy.
https://treeherder.mozilla.org/perf.html#/health?framework=2
Can we reduce the alert thresholds or make them more meaningful to allow them to sheriff them more effectively?  Also, in the case of build times, they mentioned that it's difficult to determine the root cause of the build time alert given there are a number of different AWS instance types that they run on which impact how long the build takes to run.

Notes from the meeting:https://docs.google.com/document/d/1zP4fYxbnChSEoNcswWLMj3SD4KdgfziAq4V1y-GXLK8/edit

A place to start here would be to up the threshold for alerts on sccache hit rate. This seems to account for most of the nuisance alerts in sccache. I think this is just a matter of applying the fix from bug 1358098 to the hit rate.

With respect to build time noise/instance type issues, I have burned a lot of time trying to track these down myself and completely agree it's an issue. I think a more complete solution here would be to implement a way to automatically aggregate/differentiate across instance types in perfherder. I'm not sure how much work that would be. Related to this, some way to retrigger/backfill against a specific instance type would be very helpful

Short of that we could move to mostly looking at the "plain" builds (builds with no automation steps or sccache) and ensure they only run on a single instance type to increase our confidence in the results. We still care about other build times, but the timings are already very noisy due to sccache, so I don't know how much we should try to depend on them.

Finally, if we can't figure out how to improve the tooling to make these amenable to sheriffs, there are still important regressions that get detected and fixed here (bug 1605110 comes to mind). I might suggest we move to a workflow where sheriffs file build metrics regressions and flag a build peer to investigate rather than attempting to find a specific regressing change. Most regressions in build time are going to end up needing to be assessed by a build peer anyway.

Depends on: 1614485
Type: task → enhancement
Priority: -- → P2

Glob, Chris is no longer working on this. Is anyone else following up on these problems? We've seen that the sherrifs are having problems tracking build time regressions because of the amount of noise in build time metrics.

Flags: needinfo?(glob)

This will be raised at the next Build Peers meeting.

Flags: needinfo?(glob) → needinfo?(rstewart)

I brought this up at the build peers meeting on Tuesday. Here were the main takeaways from that conversation:

  • sccache is going to introduce noisiness into build time metrics; when it is working correctly, it will introduce noise (and if the metric is NOT noisy over an extended period of time, that's probably good evidence that sccache is not working correctly). Therefore build time regressions, especially where we intend to find a single regressing push, should primarily be reported on for plain builds where sccache isn't a factor. This was pointed out in comment 1, but I don't see anything suggesting it was ever done.

  • There's a strong feeling that we should move plain builds over to running on a single instance type to keep that metric as stable as possible as well. Again, mentioned in comment 1, but not implemented yet to my knowledge.

  • The non-plain build time metrics are still meaningful inasmuch as they can point out trends over time, but the actual difference in build times between between any pair of subsequent builds, or even across a spread of a small set of builds, is more or less meaningless due to the noise mentioned. If we wanted to get automated insights into this data, we would want to do something a little bit more holistic, like tracking average build times over a period of time and triggering when there's a sudden jump in the average or minimum; I don't know if infra is equipped to allow us to define this type of alert or if that would be a feature request. Alternatively, this sort of regression is extremely easy to spot by looking at a graph, so we could migrate to a less automated, more manual process for this. Either way this wouldn't allow you to automatically pinpoint a regressing change, which we're comfortable with (we have enough expertise to diagnose that sort of thing when the existence of a regression is pointed out).

  • On our end, we suspect that if these alerts are such a persistent concern for the sheriffs and a lot of time seems to be being wasted, then something about the process should probably be refined, because a build peer is usually going to want to evaluate these regressions anyway, and we don't expect much out of the sheriffs besides the basics (like the exact alert that was triggered, a graph, and maybe a range of regressing pushes that has been refined as much as possible/is convenient). Again, mentioned in comment 1 but never followed up on to my knowledge. Importantly though revamping the process isn't something we can do by ourselves so I might benefit from syncing up with someone on this matter directly so we can clarify exactly what the expectations are that we have out of each other. The deliverable from that meeting may end up looking like a new playbook that gives the build peers more ownership and responsibility for diagnosing and resolving regressions.

Flags: needinfo?(rstewart)

bebe: please sync up with Ricky on your build_metrics proposals.

Flags: needinfo?(fstrugariu)

Created a meeting for next week

Flags: needinfo?(fstrugariu)

I think we can close this bug as build metrics load was adjusted and everything is at a normal sheriffing load.

Flags: needinfo?(rstewart)
Flags: needinfo?(dave.hunt)
Flags: needinfo?(aionescu)

Yes, I believe so. Could you summarise the actions taken here :bebe?

Flags: needinfo?(dave.hunt) → needinfo?(fstrugariu)

+1

Flags: needinfo?(rstewart)

Agree

Flags: needinfo?(aionescu)

If I remember correctly we disable sherrifing on sccache builds and started to sheriff only “Non sccache build”.

Also we agreed with Ricky that if we have any issues sherrifing build metrics or identifying the culprits we can tag his team and open up a conversation.

You can find more information on the reaserch in the doc:
https://docs.google.com/document/d/1dyjBrX4PkauulHKHisjVE8GVMzJTegARm2lrQiMZ-8Y/edit#

Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(fstrugariu)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.