Closed Bug 1683563 Opened 3 years ago Closed 3 years ago

Rebuild gecko-blame

Categories

(Webtools :: Searchfox, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kats, Assigned: kats)

References

Details

There's a bunch of things that would be fixed by rebuilding the gecko-blame repo. See dep bugs.

With a tweak to break file rewrites we can fix bug 1559145 also.

Blocks: 1559145

If we reduce the copy threshold to 40% we would fix bug 1375206, and if we reduce it to 31% we'd also fix bug 1659402. That might be a lot slower though, and may pick up false-positive "copies".

Blocks: 1375206, 1659402

Skimming the source it seems like the process is largely brute force with only some fast paths like similarity_measure deciding to fast path out on very different sized files. Which is to say it's not clear changing the metric will impact the runtime.

Unfortunately I didn't see an easy way for us to extract the resulting similarity metric computation, including re-deriving it after the fact.

In any event, I'm definitely on board with seeing what happens when we're more generous with the similarity threshold.

I wonder if this might be a good opportunity to attempt to automate this process. Specifically, something like:

  • Have a way on the command line to have trigger_indexer.py (or sibling script) initiate a blame rebuild indexing job, which mainly would mean:
    • A slightly different set of shell scripts are used and any indexer runtime failsafe expectations are changed.
    • The run would always end with the indexer stopped (after some grace period) for now.
    • The script would make sure to use a different S3 artifact name (maybe derived from a branch name?) for the blame in the success case, which would manually be pivoted into place for now.

A second follow-on phase might then be:

tail -f ~ubuntu/index-log \
  | grep --line-buffered STATUS: \
  | xargs -d '\n' -n1 matrix-commander
  • So the indexers could spam us with both the high level indexing status as well as with progress updates from the blame processing. In particular, the blame rebuilding might be taught to log status lines for specific tuples of (git or hg hash, filename, line number) which might be part of a correctness-check or behavior-scoring mechanism.

My motivating rationale is that I view the blame rebuilding process as a lot of manual work that right now requires someone with AWS creds to manually initiate and then more or less babysit from a shell, all while aware that there's a pricetag associated with leaving the server running. The bot spam I view as a potential mash-up of lower effort monitoring while also opening the door to some kind of emergent collaboration. (I do think it would potentially be useful for normal indexers though.)

Note: I'm totally fine if you are planning to rebuild the blame and you want to do it using your normal process! I'm brainstorming out loud here and I can absolutely extract the parts that seem sane to me on my next reading into an enhancement bug, etc.

I agree that automating this process would be good. Redoing the gecko-blame is particularly onerous because we have to do all the different branches too (beta, release, esrs) and it would benefit from being scripted.

That being said, it doesn't require running on an indexer; I have a spare machine at home that I'm doing the blame build on. I decided to (1) do a new build using the current code (i.e. not modifying the diff options) which will fix up the mismatched/missing hg revision problems. After that I'll (2) try to reimplement the blame implementation using git-fast-import so it's faster. I read the docs for that some more yesterday and believe it should be possible to do. I can verify that I get identical results from (1) to ensure I didn't regress anything. And then once that's in place it should be faster to (3) tweak the diff options and experiment with those if needed.

I started the blame-build yesterday evening it's at 522947/738875 commits now for master (although it gets super-linearly slower over time, so I'd expect it to actually complete later this week). Once that's done I'll do the other branches which should be faster. I'm also building a script that lays out all the steps as I do it.

Assignee: nobody → kats

(In reply to Kartikaya Gupta (email:kats@mozilla.staktrace.com) from comment #4)

That being said, it doesn't require running on an indexer; I have a spare machine at home that I'm doing the blame build on.

An excellent alternative to normal, boring electric heat for the winter! ;)

I'm also building a script that lays out all the steps as I do it.

Awesome! Thanks!

I finished the blame tarball rebuild. I copied the existing gecko-dev.tar and gecko-blame.tar tarballs into the backups/ folder in the S3 bucket and am uploading the new gecko-blame.tar tarball. I'll kick off indexing runs after that completes, to ensure that everything is working as intended and I didn't screw anything up.

release1 indexer is deployed, seems good. release2 and release3 have been kicked off and are running.

I'm still working on the git-fast-import version of the blame builder.

release2 and release3 also completed successfully. I also have the meat of the script now that rebuilds gecko-blame. Trying to figure out what form I should land it in. Right now it's intended to run locally but it should be easy to make a trigger_blame_rebuild script similar to trigger_indexer that will start an EC2 instance, run the rebuild, upload the final product, and shut itself down. That seems like a good end state.

Double-awesome (the rebuild, and the script progress)! rs=asuth on landing what you have now and then that can be evolved over time.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #9)

rs=asuth on landing what you have now and then that can be evolved over time.

Thanks, I landed it in https://github.com/mozsearch/mozsearch-mozilla/pull/124

What I was imagining was a trigger_blame_rebuild.py script alongside the trigger_indexer.py that can be triggered manually. It would also run main.sh, but then instead of having main.sh run index.sh, main.sh would instead run a different blame-rebuilding script. That script in turn would invoke the repo-specific blame rebuilding scripts and upload the results back to S3. Doing that cleanly involves some refactoring of main.sh and surrounding scripts and that's where I got stalled for now.

https://github.com/mozsearch/mozsearch-mozilla/pull/130 is the evolution of the previous script. The intent is still to have a trigger_blame_rebuild or similar script in the mozsearch repo that will call the reblame scripts of the repos to rebuild blame tarballs.

https://github.com/mozsearch/mozsearch-mozilla/pull/136
https://github.com/mozsearch/mozsearch/pull/404

These PRs implement the trigger_blame_rebuild script which works similarly to trigger_indexer. We want to extend it a bit more in the future so that it can pass the --in-place argument to the shared script and do in-place rebuilds, but for now I think it's fine to require manually rotating the new blame tarballs into place.

I started a blame-rebuild using these changes:

diff --git a/tools/src/bin/build-blame.rs b/tools/src/bin/build-blame.rs
index 1eae8df..3fde88b 100644
--- a/tools/src/bin/build-blame.rs
+++ b/tools/src/bin/build-blame.rs
@@ -553,17 +553,24 @@ fn compute_diff_data(
         let mut diff = git_repo
             .diff_tree_to_tree(
                 Some(&commit.parent(0).unwrap().tree().unwrap()),
                 Some(&commit.tree().unwrap()),
                 None,
             )
             .unwrap();
         diff.find_similar(Some(
-            DiffFindOptions::new().copies(true).rename_limit(1000000),
+            DiffFindOptions::new()
+                .copies(true)
+                .copy_threshold(30)
+                .renames(true)
+                .rename_threshold(30)
+                .rename_limit(1000000)
+                .break_rewrites(true)
+                .break_rewrites_for_renames_only(true),
         ))
         .unwrap();
         for delta in diff.deltas() {
             if !delta.old_file().id().is_zero()
                 && !delta.new_file().id().is_zero()
                 && delta.old_file().path() != delta.new_file().path()
             {
                 movement.insert(

and it errored out with this:

[2021-02-27T16:13:08Z INFO  build_blame] Transforming 0f3d2587cebf04eea7707afe8c457c8787ba4dd5 (hg None) progress 308/746489
[2021-02-27T16:13:08Z INFO  build_blame] Transforming 4c4cb0b654c022b6290d31f2863cbb80c1df4ef8 (hg None) progress 309/746489
[2021-02-27T16:13:08Z INFO  build_blame] Transforming f65b3df79d24022162c844367126131af3d1141f (hg None) progress 310/746489
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { code: -1, klass: 35, message: "unrecoverable internal error: \'delta_is_split(tgt)\'" }', src/bin/build-blame.rs:570:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2021-02-27T16:13:08Z INFO  build_blame] Transforming ba7b80b3b75cd42ef2dfbd90b992f69398572e13 (hg None) progress 311/746489
[2021-02-27T16:13:08Z INFO  build_blame] Transforming c0f7f90f73399f1993e0cdc3b06919403eafdaac (hg None) progress 312/746489

which looks like some sort of internal error in libgit2? Will need some investigation :/

Initial investigate shows that it occurs while diffing commit c29b8f890fcee1ed8cde070b3590071d8f8c2d0e to the parent, 78b4db3cce68a147d4dc4842c82b03eabd8e238b. Doesn't appear to be due to thread unsafety in libgit2 since it happens with a single compute thread. Next step is probably to make a reduced test case that I can use to iterate on debugging libgit2.

Quick update: there's been very little movement on the libgit2 issue. I did investigate and wrote a patch that I think is reasonable but the libgit2 code in question is rather obtuse and I'm not confident in my fix. The maintainer of the code is also unfamiliar with the code in question and has other time constraints so my PR has been languishing for a while. I decided to fork the dependency chain for now, so now I have mozsearch using my fork of git2-rs which is using a fork of libgit2 with the fix. Going to try rebuilding the blame with that and seeing if it works.

Now I just need to remember how to run all those reblame scripts I wrote...

The good news is that my blame-rebuild of gecko-dev (including all the beta/release/esr branches) completed on the indexer in just under 2 days which is faster than I thought it would be. The bad news is this:

+ git gc --aggressive
error: pack-objects died of signal 9
fatal: failed to run repack

so the run failed and threw away the results. I guess it makes sense that the aggressive gc uses a crapton of memory, since it tries to find the most space-efficient delta compression for the packfile, and when there's a lot of objects that's going to be quite memory intensive. Maybe it's not worth doing that since we now have weekly git gc running anyway.

The rerun without the git gc --aggressive completed successfully. The blame tarball (currently gecko-blame-new.tar in the s3 bucket) is ~10G (compared to ~3G for the regular gecko-blame.tar) so maybe I should keep a git gc as a compromise between no gc (giant tarballs) and the agressive gc (OOM).

Anyway, I also kicked off an indexer on the kats channel using the new blame tarball.

So yay, that all worked. Compare blame on:

https://searchfox.org/mozilla-central/source/browser/base/content/browser.xhtml
vs
https://kats.searchfox.org/mozilla-central/source/browser/base/content/browser.xhtml
(that fixes bug 1559145)

and

https://searchfox.org/mozilla-central/source/layout/generic/nsFrameSelection.cpp
vs
https://kats.searchfox.org/mozilla-central/source/layout/generic/nsFrameSelection.cpp
(that fixes bug 1375206)

and

https://searchfox.org/mozilla-central/source/dom/events/IMEContentObserver.cpp
vs
https://kats.searchfox.org/mozilla-central/source/dom/events/IMEContentObserver.cpp
(that fixes bug 1659402)

Now the only question is if we're comfortable using a forked version of libgit2/git2-rs or if we want to wait until libgit2 takes my patch. Another alternative is to use the new blame tarball (which fixes the above bugs) but going forward revert to the old libgit2/diff options. So that's more like a bandaid where future file renames at 30% similarity wouldn't get followed properly.

(In reply to Kartikaya Gupta (email:kats@mozilla.staktrace.com) from comment #18)

Now the only question is if we're comfortable using a forked version of libgit2/git2-rs or if we want to wait until libgit2 takes my patch. Another alternative is to use the new blame tarball (which fixes the above bugs) but going forward revert to the old libgit2/diff options. So that's more like a bandaid where future file renames at 30% similarity wouldn't get followed properly.

These are some really fantastic improvements! I can see all the way back to 2002 now in browser.xhtml!

I see no problem using a forked version in this case, but given the recent extra activity on https://github.com/libgit2/libgit2/pull/5839 initiated by you, maybe we won't even need to do that? Fingers crossed!

Indeed, the PR is now merged, yay! I'll get the patches cleaned up and ready soon.

https://github.com/mozsearch/mozsearch/pull/423 is the change to the blame-builder tool. This can land first, and then I'll do a rebuild of all the blame tarballs using the new master code and rotate them into place.

Ok I finally sorted out the git2 issues and merged the PRs, and kicked off the blame rebuild runs for all indexers. As they finish I'll rotate the tarballs into place.

The downgrading of git gc --aggressive to just git gc seems to have been poor in retrospect. The blame tarballs are coming out much bigger than the old ones. I did some more digging, and this comment provides some context from Linus himself. It specifically talks about doing a repack after a fast-import to find better deltas. I'll try that and hope that it doesn't blow up on memory again.

More reading of the man page showed me the existence of the --window-memory flag on git repack which allows setting an upper bound on the memory use per thread. So that's good. I'm now trying this: git repack -f -a -d --depth=250 --window=250 --window-memory=3g

The repack worked for all the repos I've done it on so far except whatwg-html. That still OOM'd just as it was about to start writing the new pack, but it worked with window-memory=2g. I think what happens is that with 3g it uses up 8*3 = 24g of memory while computing the compression deltas and then when it allocates whatever memory it needs at the start of the writing process, that pushes it over the 32g limit and causes it to OOM. When using 2g there's enough free memory at the end of the compression step that it doesn't OOM when writing. That repo is a bit special because it's mostly one humongous file so that might be why it behaves so differently from the rest.

Anyway, I have a PR up at https://github.com/mozsearch/mozsearch-mozilla/pull/140 that should handle both cases.

As for the blame rebuild, only kaios is still in progress, all the rest are done. I'll close this bug once that one's done.

kaios is done too, so this is all done now. Huzzah!

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.