Bug 1517978 Comment 8 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Andrew Sutherland [:asuth] (he/him)

on 2023-07-30 11:48:36 PDT

I'll be hobby-hacking on this during PTO this week so assigning to myself to avoid overlap on this goal.  This does not mean that the current blame implementation is going anywhere!  (And in particular I see from https://github.com/mozsearch/mozsearch/pull/644 that :kats may be looking at some enhancements? :)  My hope is just to get this to a prototype stage I can share[1], but I'm optimistic because I previously got a tree-sitter based tokenizer in place that implements the "semantic"[2] binding I proposed in comment 5 so I'm fairly confident about that feature.  The question for me is how much progress I'll make on the other history related functionality I'm envisioning but which does need to be baked into the history processing pipeline[3].

1: I think this probably crosses into functionality that would be useful for others even if it won't be land-able in its current state, so I've re-created an "asuth" channel that will be available at https://asuth.searchfox.org/ and will mention any seemingly working runs at https://chat.mozilla.org/#/room/#searchfox:mozilla.org and/or on my blog at https://www.visophyte.org/blog/ which is on planet.mozilla.org.  Because there are costs associated to channels like this, I'll turn it back off if no one's using it once I'm not actively hacking on this.

2: "syntax-bound" is probably a better way to put it since I'm using tree-sitter which for something like C++ is definitely not semantic, although the use of the pattern-matching query language (like is used for the "tags.scm" files) does mean that we are operating at a slightly higher level of abstraction than just the pure AST.  In particular, this does open up some possibilities to impose additional layers in the binding stack based on custom heuristics that aren't something that we would magically have if we were consume a clang AST but instead would also need to create via tree-matching.

3: Note that I'm very explicitly building on top of build-blame.rs.  It just will get used twice; the first time we build the "dual" syntax-based representation of the underlying tree, and then the second stage consumes that syntax git tree instead of the source git tree.  The only real change to history process for the second stage is that in the `'+'` and `'-'` processing we can leverage the binding stack to be able to infer code re-ordering and moves instead of treating all `'+'` diff lines as new code.  Also, this is a spot where I'd like to derive information to allow `git log -S`-like super-powers where we can do things like create token "tombstones" so that if someone searches for a token that no longer exists we can have the search results provide a link to the history of the token[4].

4: There are are some interesting options to be able to keep the amount of data that searchfox has to deal with at any time reasonable and thereby keep searchfox responsive.  In particular, my tentative plan is to take an RRDtool-like approach where the HEAD revision of any summary history files would have detailed per-revision information for "recent" changes, and then apply levels of aggregation/summarization for "medium old" and "very old" changes.  Any consolidated (JSON-ND) record could reference the pre-change-HEAD revision where the pre-aggregated data can be found.  Each aggregated block would also hold the first/last revisions of the data it's a roll-up for.  There would also be a header in the file which could allow indicating for tokens that are ridiculously common that we've categorized it as a stopword and are handling the token differently.  This needn't mean that we don't do anything for the token at all.  There could be interesting fun to be had from having weekly token added/removed stats to be able to show comparative visualizations of use of RefPtr versus UniquePtr.  Stats for the token `the` might not be quite so useful, but as long as the indexer and web-server don't catch on fire, there might not need to be special handling for extreme stop words.

Revision 1 by

Andrew Sutherland [:asuth] (he/him)

on 2023-07-30 11:53:12 PDT

I'll be hobby-hacking on this during PTO this week so assigning to myself to avoid overlap on this goal.  This does not mean that the current blame implementation is going anywhere!  (And in particular I see from https://github.com/mozsearch/mozsearch/pull/644 that :kats may be looking at some enhancements? edit: I now see this was in response to a rustc warning)  My hope is just to get this to a prototype stage I can share[1], but I'm optimistic because I previously got a tree-sitter based tokenizer in place that implements the "semantic"[2] binding I proposed in comment 5 so I'm fairly confident about that feature.  The question for me is how much progress I'll make on the other history related functionality I'm envisioning but which does need to be baked into the history processing pipeline[3].

1: I think this probably crosses into functionality that would be useful for others even if it won't be land-able in its current state, so I've re-created an "asuth" channel that will be available at https://asuth.searchfox.org/ and will mention any seemingly working runs at https://chat.mozilla.org/#/room/#searchfox:mozilla.org and/or on my blog at https://www.visophyte.org/blog/ which is on planet.mozilla.org.  Because there are costs associated to channels like this, I'll turn it back off if no one's using it once I'm not actively hacking on this.

2: "syntax-bound" is probably a better way to put it since I'm using tree-sitter which for something like C++ is definitely not semantic, although the use of the pattern-matching query language (like is used for the "tags.scm" files) does mean that we are operating at a slightly higher level of abstraction than just the pure AST.  In particular, this does open up some possibilities to impose additional layers in the binding stack based on custom heuristics that aren't something that we would magically have if we were consume a clang AST but instead would also need to create via tree-matching.

3: Note that I'm very explicitly building on top of build-blame.rs.  It just will get used twice; the first time we build the "dual" syntax-based representation of the underlying tree, and then the second stage consumes that syntax git tree instead of the source git tree.  The only real change to history process for the second stage is that in the `'+'` and `'-'` processing we can leverage the binding stack to be able to infer code re-ordering and moves instead of treating all `'+'` diff lines as new code.  Also, this is a spot where I'd like to derive information to allow `git log -S`-like super-powers where we can do things like create token "tombstones" so that if someone searches for a token that no longer exists we can have the search results provide a link to the history of the token[4].

4: There are are some interesting options to be able to keep the amount of data that searchfox has to deal with at any time reasonable and thereby keep searchfox responsive.  In particular, my tentative plan is to take an RRDtool-like approach where the HEAD revision of any summary history files would have detailed per-revision information for "recent" changes, and then apply levels of aggregation/summarization for "medium old" and "very old" changes.  Any consolidated (JSON-ND) record could reference the pre-change-HEAD revision where the pre-aggregated data can be found.  Each aggregated block would also hold the first/last revisions of the data it's a roll-up for.  There would also be a header in the file which could allow indicating for tokens that are ridiculously common that we've categorized it as a stopword and are handling the token differently.  This needn't mean that we don't do anything for the token at all.  There could be interesting fun to be had from having weekly token added/removed stats to be able to show comparative visualizations of use of RefPtr versus UniquePtr.  Stats for the token `the` might not be quite so useful, but as long as the indexer and web-server don't catch on fire, there might not need to be special handling for extreme stop words.

Revision 2 by

Andrew Sutherland [:asuth] (he/him)

on 2023-07-30 11:53:35 PDT

I'll be hobby-hacking on this during PTO this week so assigning to myself to avoid overlap on this goal.  This does not mean that the current blame implementation is going anywhere!  (And in particular I see from https://github.com/mozsearch/mozsearch/pull/644 that :kats may be looking at some enhancements? edit: I now see this was in response to a rustc warning.)  My hope is just to get this to a prototype stage I can share[1], but I'm optimistic because I previously got a tree-sitter based tokenizer in place that implements the "semantic"[2] binding I proposed in comment 5 so I'm fairly confident about that feature.  The question for me is how much progress I'll make on the other history related functionality I'm envisioning but which does need to be baked into the history processing pipeline[3].

1: I think this probably crosses into functionality that would be useful for others even if it won't be land-able in its current state, so I've re-created an "asuth" channel that will be available at https://asuth.searchfox.org/ and will mention any seemingly working runs at https://chat.mozilla.org/#/room/#searchfox:mozilla.org and/or on my blog at https://www.visophyte.org/blog/ which is on planet.mozilla.org.  Because there are costs associated to channels like this, I'll turn it back off if no one's using it once I'm not actively hacking on this.

2: "syntax-bound" is probably a better way to put it since I'm using tree-sitter which for something like C++ is definitely not semantic, although the use of the pattern-matching query language (like is used for the "tags.scm" files) does mean that we are operating at a slightly higher level of abstraction than just the pure AST.  In particular, this does open up some possibilities to impose additional layers in the binding stack based on custom heuristics that aren't something that we would magically have if we were consume a clang AST but instead would also need to create via tree-matching.

3: Note that I'm very explicitly building on top of build-blame.rs.  It just will get used twice; the first time we build the "dual" syntax-based representation of the underlying tree, and then the second stage consumes that syntax git tree instead of the source git tree.  The only real change to history process for the second stage is that in the `'+'` and `'-'` processing we can leverage the binding stack to be able to infer code re-ordering and moves instead of treating all `'+'` diff lines as new code.  Also, this is a spot where I'd like to derive information to allow `git log -S`-like super-powers where we can do things like create token "tombstones" so that if someone searches for a token that no longer exists we can have the search results provide a link to the history of the token[4].

4: There are are some interesting options to be able to keep the amount of data that searchfox has to deal with at any time reasonable and thereby keep searchfox responsive.  In particular, my tentative plan is to take an RRDtool-like approach where the HEAD revision of any summary history files would have detailed per-revision information for "recent" changes, and then apply levels of aggregation/summarization for "medium old" and "very old" changes.  Any consolidated (JSON-ND) record could reference the pre-change-HEAD revision where the pre-aggregated data can be found.  Each aggregated block would also hold the first/last revisions of the data it's a roll-up for.  There would also be a header in the file which could allow indicating for tokens that are ridiculously common that we've categorized it as a stopword and are handling the token differently.  This needn't mean that we don't do anything for the token at all.  There could be interesting fun to be had from having weekly token added/removed stats to be able to show comparative visualizations of use of RefPtr versus UniquePtr.  Stats for the token `the` might not be quite so useful, but as long as the indexer and web-server don't catch on fire, there might not need to be special handling for extreme stop words.

Back to Bug 1517978 Comment 8