Closed Bug 1653245 Opened 5 years ago Closed 5 years ago

Have build-blame write directly into packfiles instead of creating loose objects, for more efficient disk usage

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kats, Assigned: kats)

References

Details

I've noticed that if you use the build-blame tool on e.g. the glean repo, it produces a blame repo that's ~99M big (as reported by du -sh). After running git gc on this, the size comes down to 4.2M, which is a pretty massive reduction. The number of files also drops from ~25000 to ~25, so presumably this win comes from packing loose objects into packfiles.

In the case of glean (~3000 commits) it's easy enough to just do a single GC after the blame repo is built, but for m-c size repos which have hundreds of thousands of commits, the blame repo size balloons and we would want to do periodic GCs as we write revisions into the repo. The libgit2 API doesn't seem to have a "do a GC" method, so we'd have to run git gc via a shell command or something, and doing this will take progressively longer as the repo gets bigger.

I really want to find a better way to do this that doesn't involve running gc so often. If there's a better way to use the libgit2 API to write objects so that they end up in packfiles from the start rather than going into loose objects that would be good. We might have to do some sort of batching up of objects to be written beforehand, I really have no idea.

Summary: Prevent build-blame from generating a giant blame repo that `git gc` shrinks drastically → Have build-blame write directly into packfiles instead of creating loose objects, for more efficient disk usage

I looked into this a bit and couldn't figure out how to make it work. There's a PackBuilder API which can be used to write packfiles, but it takes OIDs, which means you need to have already written the objects to the repo, so you end up with the same data in both loose objects and packs. So that doesn't help.

Digging a bit deeper I found this comment which suggested setting a new odb on the repo with a single pack backend. I tried that too and got a cannot write object - unsupported in the loaded odb backends error. So I guess you can't write directly to the pack backend.

It seems like this should be possible for bulk-writing of objects into a git repo but I don't see how make it work.

/cc glandium in case he has some time to suggest things I can try.

Ah, after some more digging around looks like I need to use the mempack backend on the odb, and write to that. Then I can dump that into a packwriter to write it out to disk. Yay!

Assignee: nobody → kats

I wrote the patch to do this, and it works fine and saves a bunch of disk space. Unfortunately speed-wise it's not much better since now the main thread has to spend time doing the delta compression to write out the packfile. It's supposed to spawn threads to do it in parallel but I'm not seeing those show up in my perf profile as rendered by the Firefox profiler. Maybe one of those tools doesn't handle short-lived threads in the way I'm expecting. Or maybe it's not actually spawning the threads, in which case I'll need to fix it in libgit2.

Still, disk space win, so I might as well land it.

https://github.com/mozsearch/mozsearch/pull/356

I guess I'll file a follow-up for more speed optimizations, if I can figure out what to do to make it faster.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

Ah, turns out the the code to spawn threads when writing to the packfile is recent in libgit2, and isn't in the version that git2-rs builds. So this will fix itself in the future when we upgrade to a version of git2-rs that uses a version of libgit2 that has https://github.com/libgit2/libgit2/pull/5531

You need to log in before you can comment on or make changes to this bug.