Closed Bug 1654946 Opened 4 years ago Closed 3 years ago

Speed up the build-blame tool more

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kats, Assigned: kats)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

So even after bug 1634770 and bug 1653245, the build-blame tool still runs too slow for what I have in mind. I recently did a full build of the kaios tree blame, and it took a total of ~50 hours on an indexer instance, which is pretty ridiculous. I thought it would take a lot less but it turns out the repo gets much larger with the later commits and so it slows down a lot. It took ~2.5 hours to do the first 100k commits but about ~18.5 hours to go from commit 400k to 500k.

I got some profiles while it was running so I have a reasonably good idea of where time is being spent. Writing packfiles is still slow, mostly during this loop which reads stuff out of the mempack and writes it into a packbuilder. This part runs on a single thread and takes a lot of time; doing the delta compression after uses parallelism and is relatively quick. Also after build-blame was done processing all 542008 commits the repo was 65 GB on disk, and running git gc brought that down to 1.9 GB. So while bug 1653245 helped, there's still a lot of room for improvement.

During the early part of the build-blame run the main thread doing the I/O was the bottleneck but over time the bottleneck shifted to the compute side, and the main thread was often blocked on the compute threads. So some tweaking there is needed. I suspect the round-robin approach to compute scheduling that I implemented breaks caching/locality and makes things slower than they could be.

Do you have an idea how much of the time is in the copy/rename detection? Given that this part of the process seems like something we're unlikely to change that frequently, perhaps we could store a pre-chewed version of this information so that instead of calling Repository::diff_tree_to_tree followed by Diff::find_similar, the end-product could be more efficiently derived from multiple calls to Repository::diff_blobs which are either Diff::merged together or just processed in isolation. I guess the cheat sheet format would be something like [pre blob path, pre blob hash, post blob path, post blob hash].

For the compute threads, the bulk of the time (60%+) is spent in the diff_tree_to_tree call. A very small amount of time (< 1%) is spent in find_similar. Caching stuff so we don't have to do the diff_tree_to_tree call does sound like a reasonable idea.

Last night I also remembered git-fast-import which actually might be much better suited to our needs than using libgit2 for writing to the blame repo. From the description it sounds like it should do a better job in streaming data directly into packfiles. And I believe it give us better opportunities for parallelizing the writing as well, since we wouldn't have to do it all from a single thread but can write to the fast-import stream (with appropriate locking) from different threads. I'll explore that first since I still feel like this mempack business is the "low hanging fruit" part of the optimization.

I've been thinking about the git-fast-import approach on and off for a bit. It seems a little tricky because of the way fast-import takes its input. For merge commits in particular we'd need to keep a lot of state in-memory or on-disk to feed to fast-import and it's not immediately obvious to me if there's a more efficient way to do it.

I got around to implementing the git-fast-import mechanism. WIP is currently at https://github.com/staktrace/mozsearch/commits/fastimport

I got it to the point where it compiles and runs, and then I ran it on a tiny test repo, the version-control-tools repo, and the first 5000 commits of gecko-dev, fixing bugs that came up. It now produces identical results compared to the old version on all of those things. It's slightly slower on these small sets of commits but I expect it to handily beat the old version on the full gecko-dev. There's still some pathname sanitization that I think I should deal with, and I'll need to test it more repos to make sure it's producing the exact same results as the old code.

Assignee: nobody → kats

I finished running the new code on the full gecko-dev repo. Some notes for posterity:

  • It produced incorrect results, specifically on gecko-dev revision 8660fa72ffd78f0bbf2b9750e33551ceab77afb7. So I'll need to debug/fix that. There might be other errors after that one, but that was the first divergence in the old/new blame repos
  • Finishing the master branch took about 29h05m on my machine. This is faster than the old code, but not as much as I was expecting. Again it gets slower around the 500k commit mark in gecko-dev. I think maybe the gecko-dev tree just gets really big around there and slows everything down. That being said, there is probably some room for optimizing this further; with the new code it seems that the main thread writing to the blame repo is more of a bottleneck than with the old code, and there's likely some low-hanging fruit to optimize. I did grab some perf profiles while it was running, but the profiles showed a lot of time in __libc_write without symbols for surrounding callstack entries so I might need to fiddle with things to get better profiles.
  • The blame repo it generated was 8.1G (for just the master branch of gecko-dev). This is way better than the ~75G produced by the old code. However, it still benefits from a git repack -adf which brings it down to 2.5G. This is explained in the packfile optimization section of the git-fast-import documentation - basically, we're providing blobs commit-by-commit and so the delta compression comes out suboptimal. I don't think we can reasonably provide blobs in a better order, so it seems better to just do the repack afterwards. If we have to do a repack anyway, then it might make sense to issue checkpoint commands periodically which I didn't do in the version I just finished running.

(In reply to Kartikaya Gupta (email:kats@mozilla.staktrace.com) from comment #5)

  • It produced incorrect results, specifically on gecko-dev revision 8660fa72ffd78f0bbf2b9750e33551ceab77afb7. So I'll need to debug/fix that. There might be other errors after that one, but that was the first divergence in the old/new blame repos

Interestingly, this appears to be a second root commit in the gecko-dev repository (i.e. it has no parent). So that explains why the new code barfed on it. Should be easy to fix.

https://github.com/mozsearch/mozsearch/pull/393 has the changes to use fast-import. I verified on a bunch of repos that it produces the same results as the old code.

PR merged, closing bug.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Also for future reference, the most recent profile I got of the build-blame tool showed that the main thread (which is the current bottleneck) is spending most of its time in read and write syscalls. I think if we want to optimize further, the next thing to look at would be to have the code cache path oids from parent commits, so that fewer ls commands need to be sent to fast-import. Figuring out how to do this so that the cache invalidates properly and so that it doesn't grow unboundedly is the tricky bit.

I'm attaching screenshots of perf profiles I got using the Firefox profiler instructions.

(Screenshots are because the Firefox profiler doesn't allow sharing of perf profiles, only Gecko profiles).

Attachment #9194961 - Attachment description: Screenshot showing write syscalls taking up a lot of time → Screenshot showing read syscalls taking up a lot of time
Attachment #9194960 - Attachment description: Screenshot showing read syscalls taking up a lot of time → Screenshot showing write syscalls taking up a lot of time
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: