Investigate addition of infrastructure for collecting data on the performance of individual build actions
Categories
(Firefox Build System :: General, task)
Tracking
(firefox82 fixed)
Tracking | Status | |
---|---|---|
firefox82 | --- | fixed |
People
(Reporter: rstewart, Assigned: rstewart)
References
Details
Attachments
(1 file)
Assignee | ||
Comment 1•4 years ago
|
||
This, hopefully, begins to address an ongoing global problem where we have few, if any, insights into the performance of individual build tasks (compilations, calls into Python scripts, etc.) At most we have aggregated statistics about how long tiers last, combined with sccache
aggregates across the entire build (which don't cover non-compilation tasks). This has a few implications:
-
It's impossible to identify bottlenecks, except by going out of your way to notice and reproduce them. e.g. no one, to my knowledge, was aware that
make_dafsa.py
was a bottleneck until someone happened to notice and report it in bug 1629337. We could have systems that automatically detect this sort of thing, or at least that make it easier to do so than by CTRL-C'ing in the middle of the build several times to try to reproduce the problem. -
It's impossible to detect regressions, unless the regression is so pronounced and severe that it has an immediate impact on the overall build time and triggers build time alerts.
-
It's impossible to identify that you have fixed regressions, except by doing ad-hoc timing measurements by building individual
make
targets. This is error-prone and annoying.
Here we propose a low-friction system wherein individual build tasks log their build own perf info. For now, that's a write to stdout
consisting of the string BUILDTASK
followed by a simple JSON object with a start time, end time, the argv
of the task, and an additional "context"
key (I anticipate this could be used to annotate the task with relevant per-task for later aggregation, for example: was this an sccache
cache hit or not? For now, it's empty everywhere). The build controller then collects this data, validates it, and writes out the entire list of build tasks as a JSON file after the build has completed, similarly to what we already do with build_resources.json
. We already parse some make
output to do stuff like tracking when we switch tiers, so this isn't a huge architectural shift or anything.
In my opinion this "should" happen at the build system, or make
, level, but make
doesn't expose anything resembling this information to my knowledge, so this has to be implemented outside of make
. One could implement something like this at the sccache
level but that doesn't touch anything but C/C++/Rust compilation tasks; an ideal solution would support other generic build tasks. We could also fork make
to add this feature ourselves, but for several reasons I don't think that's tractable. :)
Of course, this approach has downsides:
-
We depend on parsing the
stdout
ofmake
, and processes can unfortunately sometimes trample on each other, leading to data loss for individual build tasks occasionally. This is a necessary limitation of the model to my knowledge, and I don't know that it can be fixed generally. In my testing, not much data tends to be lost usually. -
Dumping arbitrary data to
stdout
isn't always possible or desirable. If you're not careful about it this can also result in noisier-than-necessary tasks, especially when those tasks are not invoked by a parent process that knows how to handle the specialBUILDTASK
lines. -
This data is raw enough where aggregation is not completely trivial.
-
This functionality has to be added for any new kind of build task whose performance we'd like to track; it doesn't come "for free" due to not being able to be implemented at the build system level.
-
The data isn't awfully small due to the
argv
's (at this point, not nearly big enough where we need to be concerned about it IMO, but maybe that will change in the future?)
One can imagine a couple other architectures that could avoid the first two problems, namely: 1) we could use a "real" database that would not dump info to stdout
and wouldn't lose data, like sqlite3
; or, 2) we could set up another server, similar to sccache
, that collects this data from subprocesses and aggregates it, making sure not to lose any along the way. Both of these have enough overhead, in terms of engineering effort or actual impact on latency, where I dont know that they make any sense to even attempt implementing. The remaining continue to be real issues, however.
After this is landed there are a few ways forward. We can start uploading these files as build artifacts in CI to allow us to reason about performance impacts of changes in central
. We can easily add this functionality to the sccache
client to start tracking those builds as well. We already have a very simple visualization of build tier timing in mach resource-usage
; we could join that data against the BUILDTASK
data to produce a very clear visualization of build bottlenecks, i.e., "why is the export
tier taking so long", etc.
Pushed by rstewart@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6b22f23037b7 Add basic infrastructure for collecting data on the performance of individual build actions r=mhentges,froydnj
Comment 3•4 years ago
|
||
bugherder |
Description
•