Thanks to the power of multiple monitors, I watched a mozilla-releases.json indexer run under
htop and also another config.json run to compare while I did other stuff. There were no surprise processes chewing up the machine.
What I did see was that all 8 of the output-file binaries' memory usage was capable of growing up to 3G RES with 1G SHR that put the system-wide memory usage (not buffers, not cache) in the 12-15G range, and the system only had ~15G after kernel overhead. This appeared to obliterate the disk cache and when the output-file binaries completed and were replaced by fresh ones by
parallel they would all block on IO at a high rate for several seconds, displaying in the "D" state with CPU utilization well below 100%. In general, the output-file jobs all appeared to be largely phase-aligned, presumably due to the resulting I/O spikes being all of the fresh instances reading exactly the same data and that data being successfully cached, allowing for a multi-second catch-up window to re-align themselves.
To test how much memory mattered, I terminated the mozilla-releases.json run and re-triggered under an m5d.2xlarge which is basically a c5d.2xlarge with 2x the memory (16G becomes 32G). Local instance storage also bumps up from 200G to 300G, but since our usage of instance storage maxes out at ~86G I'm not sure the SSD performance characteristics change from the extra space. That run is time 4 below, with memory available for each run also called out in the header.
|repo (in config order)
||time 1 (16G)
||time 4 (32G)
||time 2 (16G)
||time 3 (16G)
I think one conclusion here is that memory scarcity does indeed seem to impact the indexing times, and it seems likely to be responsible for the apparent slowdown observed over time on the indexer. I think the other conclusions are that output-file does indeed need some optimization and there's still something weird going on, because mozilla-central doesn't have the problem. And that if we don't do that optimization soon or move from c5d to m5d instances, there's a real chance of OOM process terminations breaking the indexer. It's not immediately obvious if the reason mozilla-esr60 lives such a fast and furious life is because its jumps file is somehow only 16M while everyone else's jumps file is ~160M, or if it just benefits of having 20 months less blame data to deal with.
And it does appear blame is likely a large factor here. It appears the retained memory between iterating over different files in output files that's responsible for the growth over time is:
let mut diff_cache = git_ops::TreeDiffCache::new();
Presumably as we process the 600-1800 files passed to output-file (from running
wc /tmp/*.par, unsure if that might also include previous parallel runs), we end up observing more and more revisions/whatever.
For next steps I'll try some combination of using
perf to see if there are obvious problem spots in output-file, and also perhaps instrument it with https://github.com/tokio-rs/tracing and https://github.com/jonhoo/tracing-timing so we can get some visibility and cool histograms.