Open Bug 1517978 Opened 4 years ago Updated 4 months ago

Do better with skip-blame for when lines get collapsed

Categories

(Webtools :: Searchfox, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: kats, Unassigned)

References

Details

This is bug 1517657 but for searchfox, since it has the same problem. I like Karl's idea of using the most recently modified line of the ones that get collapsed.

I spent some time writing code to do this and it seems to work but is really slow. And it only fixes some of the problems. There are all sorts of other problems because within a hunk you can have lines early in the hunk getting split or merged and that throws off the blame for the rest of the lines.

I have an idea on how to do this better: for each revision to ignore, we should compute an accurate mapping from the "after" state to the "before" state, and save that in a tarball on S3 so that we don't have to recompute it on every run. Instead we would just compute the mapping for any new revisions added to the list.

In order to compute the mapping accurately I think we want to take the "before" code and the "after" code from the hunk and recursively do longest substring matches to compute a character mapping. Then, for a given line in the "after" text, go through each character on the line with a corresponding "before" character, and pick the most recently modified container line. This should build an accurate line map which we would save to disk and put in S3. But it's going to be relatively expensive to compute.

The bcmp crate should help with the recursive substring matching part since it has an implementation of Ukkonen's algorithm which is what we want for fast substring matching.

Bug 1517657 comment 2 shows another possible solution.

(In reply to Marco Castelluccio [:marco] from comment #2)

Bug 1517657 comment 2 shows another possible solution.

Holy cow, the https://cregit.linuxsources.org/code/4.19/net/ethernet/eth.c.html example seems amazing. It would definitely be fantastic for searchfox to be able to display that just a portion of a line changed, etc. I suspect this means that the code formatting logic would want a stateful HTML emitter that would track outstanding highlighting ranges and fragment them appropriately across the different lines and div nesting that we do for the position: sticky stuff. (Or the existing ad hoc code could be extended VERY carefully...)

It looks like https://github.com/mozilla/microannotate is where the tool lives?

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #3)

(In reply to Marco Castelluccio [:marco] from comment #2)

Bug 1517657 comment 2 shows another possible solution.

Holy cow, the https://cregit.linuxsources.org/code/4.19/net/ethernet/eth.c.html example seems amazing. It would definitely be fantastic for searchfox to be able to display that just a portion of a line changed, etc. I suspect this means that the code formatting logic would want a stateful HTML emitter that would track outstanding highlighting ranges and fragment them appropriately across the different lines and div nesting that we do for the position: sticky stuff. (Or the existing ad hoc code could be extended VERY carefully...)

It looks like https://github.com/mozilla/microannotate is where the tool lives?

Sorry for not replying earlier... Yes, that's the tool! It supports splitting by word, and also has an option for removing comments (useful if you're interested in blaming code only, but I guess it's not the case for Searchfox).

I spent some time looking at https://github.com/mozilla/microannotate with some help from Marco today (in #codecoverage on chat.mozilla.org). My understanding is:

  • microannotate transforms the source so that there's one non-whitespace token per source line broken up using a regex using \w+ and explicit treatment of a bunch of programming punctuation characters as their own tokens.
    • Whitespace is ignored and is re-inferred from the original text/token stream at rendering/generation time.
    • The optional rust-code-analysis hookup is only for comment removal and does not inform the determination of what's a token.
  • This allows git's normal line-centric blame/annotate logic to operate on a per-token basis (which means searchfox's blame precomputation strategy is still viable).
  • Because whitespace is no longer significant, changes to whitespace don't impact blame history at all. Other code formatting changes like introduction/removal of braces/extra parentheses similarly do not meaningfully impact blame.

Various thoughts:

  • The big question is whether this makes the greedy diff algorithm do stuff that's unacceptably wackier than a line-centric approach. It opens up more territory for bizarre longest subsequences even as it removes whitespace from the equation.
    • Looking at some random git log -p's from the derived repo at https://github.com/marco-c/gecko-dev-wordified (may be private), it seems like there are some weird reuses of random punctuation tokens, but that's hardly surprising and is something that can potentially be addressed by making the policy decision to not surface / allow following the blame of random punctuation that's almost certain to be nonsensical.
    • Otherwise the diffs make sense and that also makes sense because most patches to mozilla-central are going to be quite small in size, especially with whitespace removed from the equation.
  • The availability of https://github.com/mozilla/rust-code-analysis presents some interesting potential to semantically bind the punctuation characters so that they aren't randomly reused and thereby help avoid weird diffs.
    • That is, for the closing } of the function doFoo, one could emit } doFoo or } HASH where HASH is a fixed-size hash function over "doFoo" with a trade-off between string length and collision chance.
    • This additional semantic data attached to the tokens could also assist in reconstructing moves in a file as tokens whose identities under normal blame processing would be ambiguous are rendered entirely unambiguous. If the opening and closing braces of a function were both deleted and reappeared elsewhere in the file, that's a move, eh?
    • That said, one might also go directly to an AST-based diff, such as https://github.com/afnanenayet/diffsitter but that would represent a more fundamental architectural change.

Using rust-code-analysis to split lines in a smarter way was one of the things I wanted to try at the time.

That said, one might also go directly to an AST-based diff, such as https://github.com/afnanenayet/diffsitter but that would represent a more fundamental architectural change.

This would be super interesting, I wanted to support something like that in rust-code-analysis! Yeah, I think using that would be way more complex than just word-level, where you can more or less use the tools we already have with very few changes.

An interesting practical example of this the diff algorithm doing unexpected-but-reasonable things just came up in a use-case for :Gijs where the question of the origin of nsWebBrowserPersist::CalculateAndAppendFileExt came up. microannotate says the origin is https://hg.mozilla.org/mozilla-central/rev/7c5745d2ff1f1c26945f5d68b9c68673e62c5797 and the gecko-dev-wordified commit at https://github.com/marco-c/gecko-dev-wordified/commit/a786fcc4e1dc666300f0ad5a59a3e7e407466e25 seems to show that what's happening is that because a relatively large block of code was moved around (nsWebBrowserPersist::EnumPersistURIs) in that commit, the diff algorithm found it was easier to move the nsWebBrowserPersist::CalculateAndAppendFileExt token stream around by deleting it and re-adding it.

That is, the trade-off was made here between which blame should survive and it picked the less obvious one.

You need to log in before you can comment on or make changes to this bug.