Bug 1517978 Comment 5 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

I spent some time looking at https://github.com/mozilla/microannotate with some help from Marco today (in #codecoverage on chat.mozilla.org).  My understanding is:
- microannotate transforms the source so that there's one non-whitespace token per source line broken up using a regex using `\w+` and explicit treatment of a bunch of programming punctuation characters as their own tokens.
  - Whitespace is ignored and is re-inferred from the original text/token stream at rendering/generation time.
  - The optional rust-code-analysis hookup is only for comment removal and does not inform the determination of what's a token.
- This allows git's normal line-centric blame/annotate logic to operate on a per-token basis (which means searchfox's blame precomputation strategy is still viable).
- Because whitespace is no longer significant, changes to whitespace don't impact blame history at all.

Various thoughts:
- The big question is whether this makes the greedy diff algorithm do stuff that's unacceptably wackier than a line-centric approach.  It opens up more territory for bizarre longest subsequences even as it removes whitespace from the equation.
  - Looking at some random `git log -p`'s from the derived repo at https://github.com/marco-c/gecko-dev-wordified (may be private), it seems like there are some weird reuses of random punctuation tokens, but that's hardly surprising and is something that can potentially be addressed by making the policy decision to not surface / allow following the blame of random punctuation that's almost certain to be nonsensical.
  - Otherwise the diffs make sense and that also makes sense because most patches to mozilla-central are going to be quite small in size, especially with whitespace removed from the equation.
- The availability of https://github.com/mozilla/rust-code-analysis presents some interesting potential to semantically bind the punctuation characters so that they aren't randomly reused and thereby help avoid weird diffs.
  - That is, for the closing `}` of the function `doFoo`, one could emit `} doFoo` or `} HASH` where HASH is a fixed-size hash function over "doFoo" with a trade-off between string length and collision chance.
  - This additional semantic data attached to the tokens could also assist in reconstructing moves in a file as tokens whose identities under normal blame processing would be ambiguous are rendered entirely unambiguous.  If the opening and closing braces of a function were both deleted and reappeared elsewhere in the file, that's a move, eh?
I spent some time looking at https://github.com/mozilla/microannotate with some help from Marco today (in #codecoverage on chat.mozilla.org).  My understanding is:
- microannotate transforms the source so that there's one non-whitespace token per source line broken up using a regex using `\w+` and explicit treatment of a bunch of programming punctuation characters as their own tokens.
  - Whitespace is ignored and is re-inferred from the original text/token stream at rendering/generation time.
  - The optional rust-code-analysis hookup is only for comment removal and does not inform the determination of what's a token.
- This allows git's normal line-centric blame/annotate logic to operate on a per-token basis (which means searchfox's blame precomputation strategy is still viable).
- Because whitespace is no longer significant, changes to whitespace don't impact blame history at all.  Other code formatting changes like introduction/removal of braces/extra parentheses similarly do not meaningfully impact blame.

Various thoughts:
- The big question is whether this makes the greedy diff algorithm do stuff that's unacceptably wackier than a line-centric approach.  It opens up more territory for bizarre longest subsequences even as it removes whitespace from the equation.
  - Looking at some random `git log -p`'s from the derived repo at https://github.com/marco-c/gecko-dev-wordified (may be private), it seems like there are some weird reuses of random punctuation tokens, but that's hardly surprising and is something that can potentially be addressed by making the policy decision to not surface / allow following the blame of random punctuation that's almost certain to be nonsensical.
  - Otherwise the diffs make sense and that also makes sense because most patches to mozilla-central are going to be quite small in size, especially with whitespace removed from the equation.
- The availability of https://github.com/mozilla/rust-code-analysis presents some interesting potential to semantically bind the punctuation characters so that they aren't randomly reused and thereby help avoid weird diffs.
  - That is, for the closing `}` of the function `doFoo`, one could emit `} doFoo` or `} HASH` where HASH is a fixed-size hash function over "doFoo" with a trade-off between string length and collision chance.
  - This additional semantic data attached to the tokens could also assist in reconstructing moves in a file as tokens whose identities under normal blame processing would be ambiguous are rendered entirely unambiguous.  If the opening and closing braces of a function were both deleted and reappeared elsewhere in the file, that's a move, eh?
I spent some time looking at https://github.com/mozilla/microannotate with some help from Marco today (in #codecoverage on chat.mozilla.org).  My understanding is:
- microannotate transforms the source so that there's one non-whitespace token per source line broken up using a regex using `\w+` and explicit treatment of a bunch of programming punctuation characters as their own tokens.
  - Whitespace is ignored and is re-inferred from the original text/token stream at rendering/generation time.
  - The optional rust-code-analysis hookup is only for comment removal and does not inform the determination of what's a token.
- This allows git's normal line-centric blame/annotate logic to operate on a per-token basis (which means searchfox's blame precomputation strategy is still viable).
- Because whitespace is no longer significant, changes to whitespace don't impact blame history at all.  Other code formatting changes like introduction/removal of braces/extra parentheses similarly do not meaningfully impact blame.

Various thoughts:
- The big question is whether this makes the greedy diff algorithm do stuff that's unacceptably wackier than a line-centric approach.  It opens up more territory for bizarre longest subsequences even as it removes whitespace from the equation.
  - Looking at some random `git log -p`'s from the derived repo at https://github.com/marco-c/gecko-dev-wordified (may be private), it seems like there are some weird reuses of random punctuation tokens, but that's hardly surprising and is something that can potentially be addressed by making the policy decision to not surface / allow following the blame of random punctuation that's almost certain to be nonsensical.
  - Otherwise the diffs make sense and that also makes sense because most patches to mozilla-central are going to be quite small in size, especially with whitespace removed from the equation.
- The availability of https://github.com/mozilla/rust-code-analysis presents some interesting potential to semantically bind the punctuation characters so that they aren't randomly reused and thereby help avoid weird diffs.
  - That is, for the closing `}` of the function `doFoo`, one could emit `} doFoo` or `} HASH` where HASH is a fixed-size hash function over "doFoo" with a trade-off between string length and collision chance.
  - This additional semantic data attached to the tokens could also assist in reconstructing moves in a file as tokens whose identities under normal blame processing would be ambiguous are rendered entirely unambiguous.  If the opening and closing braces of a function were both deleted and reappeared elsewhere in the file, that's a move, eh?
  - That said, one might also go directly to an AST-based diff, such as https://github.com/afnanenayet/diffsitter but that would represent a more fundamental architectural change.

Back to Bug 1517978 Comment 5