Open Bug 1778802 Opened 3 years ago Updated 2 years ago

Consider storing m-c analysis data in a git repo artifact with a bounded history via `git checkout --orphan` to enable try branch/code review features and recent semantic history support

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

(Blocks 1 open bug)

Details

Currently searchfox only has access to the current indexed revision's analysis data. We could store the analysis data in git and potentially gain some useful abilities.

Implementation Sketch

Major disclaimer: "runaway diffs"

  • In the current analysis rep record order in the file does not matter and every record contains a "loc" which identifies locations using the line and column number. This inherently means that changes which add or remove lines will tend to have cascading impacts on every subsequent analysis record for the following source lines.
  • This could be addressed by creating a https://github.com/mozilla/microannotate/ style dual of the representation where the ordering does matter and the stream needs to be re-synchronized with the underlying token stream for processing.
    • It's currently the case that output formatting requires that we have a tokenizer that can generate a superset of the tokens covered by the analysis records.
    • One could also imagine a rep that still binds the records with position info but using the path to the tokens in the AST hierarchy as determine by the source of the the tokens (ex: clang) or an approximate re-construction via tree-sitter.
      • This is something we potentially might do for the microannotate solution in order to bind syntax tokens more tightly to the AST as the diff algorithm can otherwise make semantically incorrect decisions just because it saw a bunch of braces.
      • Because our analysis records include a "contextsym" they effectively already do this additional binding, so it's really a question of getting rid of the "loc".
  • As long as we limit our history period, the effects of the extra noise should not result in major scaling problems and any diff-processing we do can side-step the issue by performing the dual transformation on both revisions of the analysis file and doing the diff itself.
    • This would mainly be desirable to avoid front-loading logistical leg work and because by doing it lazily we can also a ton of optional debug output that would be a major logistical hassle to do it the efficient way.

Searchfox taskcluster analysis job changes:

  • Each searchfox m-c release taskcluster run would:
    • attempt to use the analysis data git repo from the previous (platform) run as the repo for its run.
    • use git checkout --orphan or other tree re-writing to truncate at our analysis data horizon.
      • If we address the "runaway diff" problem we might be able to optionally do some kind of RRD style decimation so we'd have every index for the last week then every other index for the rest of the month then every 4th index, etc. In theory one could provide linkage to other indexed trees/ESRs, but the further back history goes the harder it becomes to change our data reps.
    • publish both the repo and the new analysis revision as git bundles as artifacts. The goal here is to ensure that each indexing run can have the full repo to start from but that anyone who might have the previous day's state can download just the delta.
  • Each searchfox m-c try run (AKA non-release and non-dev) taskcluster run would:
    • Use the base m-c indexing run's repo
    • only publish the delta bundle. This is an attempt to keep overall costs down for taskcluster job storage and also so that the work required for searchfox to ingest changes is O(changes) not O(size of the m-c repo).

Searchfox AWS indexing job changes:

  • Searchfox actually derives an aggregated analysis directory from all of the platforms it downloads, combining them via merge-analyses. So it would maintain an analysis git repo using effectively the same logic that each of the taskcluster indexing jobs used as well.
    • This would also make a potentially useful place to experiment with the alternate dual rep since it doesn't require landing things in mozilla-central to iterate on.
    • We could also experiment with this with this being the only place we use a git repo for history. We really only want the taskcluster job changes so that we can efficiently get the analysis deltas. Right now the "target.mozsearch-index.zip" files are ~426M which is 2 orders of magnitude too large for the web-servers to ingest dynamically and arguably 1 order of magnitude too large for the indexers.

Searchfox AWS web-server changes would entail a varying amount of additional processing logic. We would likely leave router.py unchanged, only enhancing pipeline-server.rs and maybe web-server.rs (or maybe we transition some of its jobs to pipeline-server.rs). The scariest thing is if we did any dynamic ingestion of try builds, which would mutate whatever git tree(s) we're using on the fly which is perhaps scary. This could be mitigated by copying however much of the "mozilla-central" tree over to something like "try-central" which is the only place we'd do that. Also we could avoid dynamic ingestion in favor of just scraping treeherder try jobs for searchfox jobs when running the indexer.

Potential Functionality Enhancements

With this substrate, we can potentially do the following, in order of increasing effort:

  • Semantically highlight recent revisions which we have analysis data for, including both sides of a diff!
  • Efficiently identify the semantic changes between recent m-c states as well as try pushes from their base revisions and this enables us to do a bunch of clever things with additional work.
  • Ingest try server runs on-the-fly since there potentially isn't that much information to ingest and we can speculatively issue GETs against localhost to compel the pre-computation/caching of all impacted files/diffs (which nginx will then cache).
    • This would likely necessitate using a heftier EC2 instance but the costs are trivial compared to engineer time
  • Use the diff data to efficiently allow the crossref database to:
    • time travel, providing results from recent revisions
    • show current results as well as "recently removed results" and labeling "recently added/changes results"
  • Perform semantic diffing where we have "structured" analysis records which capture class hierarchies, methods/fields on classes, etc.
You need to log in before you can comment on or make changes to this bug.