Closed Bug 1646794 Opened 4 years ago Closed 4 years ago

Investigate addition of infrastructure for collecting data on the performance of individual build actions

Categories

(Firefox Build System :: General, task)

task

Tracking

(firefox82 fixed)

RESOLVED FIXED
82 Branch
Tracking Status
firefox82 --- fixed

People

(Reporter: rstewart, Assigned: rstewart)

References

Details

Attachments

(1 file)

No description provided.

This, hopefully, begins to address an ongoing global problem where we have few, if any, insights into the performance of individual build tasks (compilations, calls into Python scripts, etc.) At most we have aggregated statistics about how long tiers last, combined with sccache aggregates across the entire build (which don't cover non-compilation tasks). This has a few implications:

  1. It's impossible to identify bottlenecks, except by going out of your way to notice and reproduce them. e.g. no one, to my knowledge, was aware that make_dafsa.py was a bottleneck until someone happened to notice and report it in bug 1629337. We could have systems that automatically detect this sort of thing, or at least that make it easier to do so than by CTRL-C'ing in the middle of the build several times to try to reproduce the problem.

  2. It's impossible to detect regressions, unless the regression is so pronounced and severe that it has an immediate impact on the overall build time and triggers build time alerts.

  3. It's impossible to identify that you have fixed regressions, except by doing ad-hoc timing measurements by building individual make targets. This is error-prone and annoying.

Here we propose a low-friction system wherein individual build tasks log their build own perf info. For now, that's a write to stdout consisting of the string BUILDTASK followed by a simple JSON object with a start time, end time, the argv of the task, and an additional "context" key (I anticipate this could be used to annotate the task with relevant per-task for later aggregation, for example: was this an sccache cache hit or not? For now, it's empty everywhere). The build controller then collects this data, validates it, and writes out the entire list of build tasks as a JSON file after the build has completed, similarly to what we already do with build_resources.json. We already parse some make output to do stuff like tracking when we switch tiers, so this isn't a huge architectural shift or anything.

In my opinion this "should" happen at the build system, or make, level, but make doesn't expose anything resembling this information to my knowledge, so this has to be implemented outside of make. One could implement something like this at the sccache level but that doesn't touch anything but C/C++/Rust compilation tasks; an ideal solution would support other generic build tasks. We could also fork make to add this feature ourselves, but for several reasons I don't think that's tractable. :)

Of course, this approach has downsides:

  1. We depend on parsing the stdout of make, and processes can unfortunately sometimes trample on each other, leading to data loss for individual build tasks occasionally. This is a necessary limitation of the model to my knowledge, and I don't know that it can be fixed generally. In my testing, not much data tends to be lost usually.

  2. Dumping arbitrary data to stdout isn't always possible or desirable. If you're not careful about it this can also result in noisier-than-necessary tasks, especially when those tasks are not invoked by a parent process that knows how to handle the special BUILDTASK lines.

  3. This data is raw enough where aggregation is not completely trivial.

  4. This functionality has to be added for any new kind of build task whose performance we'd like to track; it doesn't come "for free" due to not being able to be implemented at the build system level.

  5. The data isn't awfully small due to the argv's (at this point, not nearly big enough where we need to be concerned about it IMO, but maybe that will change in the future?)

One can imagine a couple other architectures that could avoid the first two problems, namely: 1) we could use a "real" database that would not dump info to stdout and wouldn't lose data, like sqlite3; or, 2) we could set up another server, similar to sccache, that collects this data from subprocesses and aggregates it, making sure not to lose any along the way. Both of these have enough overhead, in terms of engineering effort or actual impact on latency, where I dont know that they make any sense to even attempt implementing. The remaining continue to be real issues, however.

After this is landed there are a few ways forward. We can start uploading these files as build artifacts in CI to allow us to reason about performance impacts of changes in central. We can easily add this functionality to the sccache client to start tracking those builds as well. We already have a very simple visualization of build tier timing in mach resource-usage; we could join that data against the BUILDTASK data to produce a very clear visualization of build bottlenecks, i.e., "why is the export tier taking so long", etc.

Pushed by rstewart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6b22f23037b7
Add basic infrastructure for collecting data on the performance of individual build actions r=mhentges,froydnj
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 82 Branch
Regressions: 1662573
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: