Closed Bug 1776522 Opened 7 months ago Closed 7 hours ago

Move from using ANALYSIS_DATA for context menus entirely to SYM_INFO (or new name) and stop generating ANALYSIS_DATA.

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asuth, Assigned: asuth)

References

(Blocks 1 open bug)

Details

I've discussed this in passing several times, but the basic situation is that currently context menus are powered by data-i attributes which provide a numeric index into the ANALYSIS_DATA array we build during HTML generation. Each entry in the ANALYSIS_DATA array is an array tuple of the form:

  1. jumps: An array listing known definitions relevant for the current token position of the form { sym, pretty, path } where path also includes a line number anchor on the end for the line containing the definition. Multiple definitions can be associated with a token resulting in multiple "go to def of" menu items. We now use direct links since a recent enhancement, but previously we bounced through the "define" CGI endpoint but that causes spurious page reloads for in-page navigations.
  2. searches: An array listing known symbols relevant for the current token position of the form { sym, pretty }. These always present as searches in the menu.

ANALYSIS_DATA is incredibly redundant. No attempt is made to de-duplicate entries in the array. There can be some level of variation between entries for the same token/symbol, but these amount to:

  • We don't emit a jump/definition on the line that we know is where the definition is, because there's no point going to the line you're already on.
  • Sometimes a token like a constructor ends up with a whole bunch of other stuff mis-attributed to that token, like implicit constructors.

Since the landing of the structured record support we actually do have a SYM_INFO dict already providing { syntax, type, typesym } meta-information for locals which are not available in the crossref database as an artifact of the ReactJS-based prototype fancy UI which will never land.

The plan here is:

  • Store all the information ANALYSIS_DATA currently provides in the SYM_INFO per-symbol map.
  • Populate the context menu by walking the list of data-symbols (which we already populate for use by our hover mechanism and the ReactJS UI knew how to consume) and lookups against SYM_INFO.
  • SYM_INFO is also able to store additional information like the structured record information for a symbol so that this can be used to show cool information in the super sidebar, or bootstrap "query" requests that the super sidebar issues.
    • We are not including the complete crossref information for a symbol, as this can be quite voluminous and we have no intent to surface the results in the current page. The idea is to have the structured info, the "pretty" independently for fallback when there's no structured info, plus any singleton def/decl/idl hits.
  • The SYM_INFO rep also allows for the "query" mechanism's dynamically generated graphs to have an identical interaction pattern.

New features that will be added during this process and will constitute a change in behavior:

  • We are no longer limited to a single "jump" per symbol which means that we can provide all of:
    • Go to IDL (currently this supersedes go to def)
    • Go to definition
    • Go to declaration. I've found this useful for C++ where I'd like to go to the header where the comments are (in the Fancy branch prototype UI). I could see including some kind of iconic badge on the decl (and def too) if we know there's a block comment associated with them, which would be the case if we have a peekRange with a first line before the canonical line of the def.

It's unclear what the size trade-off will be here, as there should be much less duplication, but there will be a net increase in data stored.

I may have discussed this previously in the context of also changing our blame lookups to avoid network requests in favor of having that data stored in the document as well in a JSON blob or templates or something, but that change won't happen as part of this change.

I've got the core of this implemented, but have run into bug 1727789 adjacent issues where the cleverness we added in bug 1759064 to add multiple "source" records knowing that they would translate directly into "search" directives in the ANALYSIS_DATA [jumps, searches] model is now causing problems.

My tentative plan is to:

  • Have idl-analyze.py emit a "structured" record.
  • Generalize the "structured" analysis record's "srcsym"/"targetsym" model to support the idea of named "slots" (or "relations"/"rel"s?) like I propose in bug 1727789 under the "IDL analyzer simplification" sub-heading.
  • The structured record will also identify the underlying language/scheme we're using, and for idl-analyze.py that will be "xpidl" which will mark a little bit of a transition for us as right now we do tend to refer to XPIDL as "idl" but I think we need to try and move away from that because it's confusing and we can benefit from having a more generic concept of IDL which isn't conflated with XPIDL.
  • context-menu.js for now will have all the UX smarts about how to make use of those slots to build the searches. The rationale is that:
    • This is the most obvious and easiest way for (potential) contributors to change searchfox's behavior.
    • The alternate approach would be to define some kind of JSON or TOML schema to map these and have the crossref process apply the schema to the slots when building the database, but unless the server needs to uniformly apply the same logic, this feels more like obfuscation than being helpful.
      • Right now the server doesn't need to know these things because the "search" endpoint just takes the list of symbols to search on and loses the semantic concept that the menu was getting at. We can change this in the future with the "query" endpoint because it seems like most people don't get what's going on with the resulting "search" spec and how it can be manipulated, but that's further down the road when the "query" endpoint becomes usable.
Blocks: 1800008

I'm finishing this up now after having landed a bunch of important precursor work like the settings page which in turn wanted more rust-based templating, etc.. A notable change I had to make was to introduce a new "jumpref"/"jumpref-extra" pair using the crossref format.

My first attempt at this was to have output-file just directly perform crossref lookups whenever it found a symbol in a file that it had not seen yet. The crossref entry would then be passed through a transformation function (convert_crossref_value_to_sym_info_rep) and saved into the map which will be written into the HTML file. (And that symbol would never be looked up again for the file being formatted, but would be for other files in that run.)

This ended up insanely regressing performance. Like, for dom/indexedDB/ActorsParent.cpp (a very large C++ file) there was (using only 1 data point for each measurement) a 218x regression on the formatting stage of output-file. So I modified crossref.rs to also run the transformation function on the crossref data as we wrote it to the crossref file, building a parallel jumpref fiile. Here's what the file sizes end up looking like:

ubuntu@ip-IP:~/index/mozilla-central$ du -csh *ref*
2.3G    crossref
4.3G    crossref-extra
927M    jumpref
317M    jumpref-extra
7.8G    total

The reason the jumpref files aren't more trivial in size is that the contents are:

  • sym: The symbol!
  • pretty: The pretty identifier for the symbol.
  • jumps: A set of up to 3 locations for idl/def/decl (small)
  • meta: The structured information payload for the symbol plus derived information. This can get large for symbols like (pretty) mozilla::ipc::IProtocol::AllManagedActors where there are a ton of overrides, or nsISupports where there are a ton of subclasses. For now I'm going to err on the side of keeping this data since it potentially allows for a bunch of client-side prototyping and maybe even it makes sense for a bunch of widgets or functionality to run client-side without having to talk to the server[1]. The change doesn't actually bloat the rendered HTML (with the baked-in JSON data as JS), with dom/indexedDB/ActorsParent.cpp.gz going from 1.4M on trunk to 1.3M with the patch. This makes sense because as comment 0 points out, the ANALYSIS_DATA array was incredibly redundant and was O(symbol-referencing tokens) whereas the new SYM_INFO is O(symbols referenced in the source).

There are also situations where there are just very long symbol names that can push us over the threshold into "-extra" gifted to us by rlbox in particular. Like:

w2c_std____2__pair_std____2____tree_iterator_std____2____value_type_unsigned_int__unsigned_int___std____2____tree_node_std____2____value_type_unsigned_int__unsigned_int___void_____long___bool__std____2____tree_std____2____value_type_unsigned_int__unsigned_int___std____2____map_value_compare_unsigned_int__std____2____value_type_unsigned_int__unsigned_int___std____2__less_unsigned_int___true___std____2__allocator_std____2____value_type_unsigned_int__unsigned_int_______emplace_unique_key_args_unsigned_int__std____2__piecewise_construct_t_const___std____2__tuple_unsigned_int_const____std____2__tuple____unsigned_int_const___std____2__piecewise_construct_t_const___std____2__tuple_unsigned_int_const______std____2__tuple____

The move to jumpref netted us a speed-up, taking 98.9% of the current trunk run. But again, these are individual data points. A better metric is that output.sh now seems to take about 30 seconds longer, up to 6m31s from 5m56s, so this potentially makes us slower, but not by much. This is an acceptable performance regression for our purposes, especially because in bug 1781179 :botond wants to do some semantic highlighting which needs something like a crossref lookup. And even if we do move to using tree-sitter as a tokenizer in bug 1796870 which potentially would let us get syntax highlighting from just the parse tree without needing the clang details, there are things like inlay hints about the types being used that do in fact want crossref-like lookup.

1: In general my plan has been to bias towards to try and do everything server-side, even the widgets, because we want to avoid ending up in a situation where searchfox loads 10 MiB of JS at startup and you end up looking at serious latency to get results. Having searchfox do things server-side in a way that can be cached (and pre-cached) has a lot of upsides. (Even in cases where a user is network limited, the fancy branch prototype showed that spidering symbols via separate requests would be the real latency problem.) And if the canonical way to do things is in rust, then we run into consistency problems if we also have a JS implementation, etc. etc. But perhaps we can strike a balance with widgets.

Note that the choice of the liquid templating engine was so that we could use it both on the client and the server, but I wasn't envisioning using it for something so complex. It's definitely been an improvement on doing all the HTML generation in rust, but the ergonomics aren't good enough for something more generic. Probably better to use an embedded JS runtime. Although maybe if we come up with some good web components with rules that keep their behavior simple and fast, the answer can just be client-side web components.

This landed and stuck.

Status: ASSIGNED → RESOLVED
Closed: 7 hours ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.