Open Bug 1697671 Opened 3 years ago Updated 1 year ago

Link resource:// and chrome:// URLs in source listings to the underlying source files

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

Details

Archiving an enhancement proposal by :nika in #searchfox on matrix today first message link, we should "make the resource:// and chrome:// URLs in our js/c++ code link to the relevant source file in searchfox".

The domain synopsis is:

  • The information is already made available via the chrome-map.json artifact built by code coverage.
    • The relevant routes are:
      • gecko.v2.mozilla-central.latest.firefox.linux64-ccov-opt
      • gecko.v2.mozilla-central.revision.REVISION.firefox.linux64-ccov-opt
    • You can find the jobs via "-ccov" filtering like so for linux64
    • Nika's analysis of the data is: "It looks like [2] has a series of paths which were copied into dist, and everything under dist/bin/ is in GreD, and everything under dist/bin/browser/ is under XCurProcD (IIRC)"

In terms of implementing this in searchfox, right now the situation is that all of our linkification logic happens very late in the pipeline the output formatting stage for text/comment/string tokens via link logic in links.rs that handles general link detection via the linkify crate plus hard-coded Bugzilla/servo/WPT PR linkification magic.

That's actually a quite easy place to put the logic, the problem is that it completely bypasses the whole analysis infrastructure that feeds into cross-referencing and such. What we really want is something consistent with :kats' enhancements in bug 1418001 that produces analysis records like {"loc":"00007:9-26","target":1,"kind":"use","pretty":"dom/workers/WorkerPrivate.h","sym":"FILE_dom/workers/WorkerPrivate@2Eh"} that explicitly match up with the string token and have an explicit symbol identifier that gets created and cross-referenced.

In a fully normalized pipeline, the way that might work is:

  • We teach fetch-tc-artifacts to download chrome-map.json.
  • We teach deriver-per-file-info.rs to ingest chrome-map.json and populate concise-per-file-info.json with the info necessary to map the URLs to the source files.
  • Indexers like the C++ indexer (written in C++) and the JS indexer/analyzer itself written in JS (and only the JS indexer to start) emit analysis records for strings.
    • This might also possibly include comments, or tokens in comments that appear to match some interesting-ness heuristic, like being enclosed in backticks or being camel-cased or having "::" inside them, etc. In the C++ case, clang actually fully understands doxygen syntax so there's a ton of low-hanging fruit if one were so interested.
  • A meta-indexer written in rust (so we can reuse the linkify logic and generally only write the logic once) processes these strings and converts them into richer symbol references, re-writing the analysis files in the process.
    • For an initial implementation, this could be run against analysis files as we read them as part of the cross-referencing process (but re-writing them if any links are encountered). The restriction would be that the only data that could be looked up would be from concise-per-file-info.json, which would work for this scenario. It would not work for things like ClassName which would depend on the normal cross-referencing process having already been completed.
      • One could obviously do something involving futures/to-do lists, but that still doesn't seem like something to do in the MVP.
    • This would involve extracting some of the concise-file-info support logic in output-file.rs out so it can be reused.

An evolutionary step towards this could be to just do the first two steps that get the info into the concise-file-info and just propagate that to the existing formatting output logic in the existing late-linkifying logic.

This might also possibly include comments, or tokens in comments that appear to match some interesting-ness heuristic, like being enclosed in backticks or being camel-cased or having "::" inside them, etc. In the C++ case, clang actually fully understands doxygen syntax so there's a ton of low-hanging fruit if one were so interested.

FWIW I think any initial fix to this issue probably shouldn't try to search for other identifiers in comments, and should just focus on processing out URIs from comment and string tokens, as that's already fairly well defined. The more general-case of generically handling interesting tokens in comments is definitely interesting, but is much more language-specific than the URL case, which is pretty firmly language-agnostic (hence why we previously did it in the last phase).

I think I might want to do the meta-indexer which extracts out links fully generically so that it can operate across all files, like the linker does today in the final step, and just augments the analysis files with extra information when it finds resource:// etc. URIs. That way we don't need to have support added to every language, and we just get the behaviour by-default.

See Also: → 1763532

Due to some yak-shaving that has snowballed, I think I can address the root resource/chrome URL linkification feature largely as proposed by Nika in comment 1 by:

  • Continuing to perform linkification of comments at output time, informed by the tokenizer.
  • Also performing a variant of linkification on string literals which did not have an associated source record. (For C++ #include directives we will have emitted a source record for the included file, and we will defer to that.)
  • Explicitly having a mechanism in place that knows how to apply tree-defined transforms or lookups.

This will explicitly NOT:

  • Provide full semantic cross-referencing of resource/chrome URLs. This is just links! But one can absolutely search for those links via our full-text search and it should be fine. And maybe we can hook this up to trigger the context menu in the short-term before people's muscle memory solidifies.
  • Handle URLs that do not appear in their entirety in the source. Clever C code doing things like leveraging that "resource://" "blah" ends up "resource://blah" will not be linkified! This also means that clever preprocessor stuff won't work. These could be made to work in the future once we get the analysis pass emitting the literals (which we will probably need for bug 1779340).
  • Provide for configurable alternate regex based URL transforms like our hardcoded linkify_bug_numbers, but this is a clear next step to allow trees like wubkat to define a non-mozilla-centric mapping. We could also imagine URL transform based mechanisms to try and provide support for semantic magic on deep links in future work as well.

In my current yak-shaving, derive-per-file-info.rs and its very hand-rolled rust code for file ingestion are being replaced with a more generic mechanism that consumes a .toml config file that specifies how to extract info from JSON files and map it into "concise" and "detailed" per-file information. We can add a url_aliases mechanism and add the required glue code[1] for chrome-map.json to help populate this. The lookup map can then use url_aliases while we should also be able to show in the source listing for a file too, which might be useful context, especially if we linkify that into a full-text search for the given string, making it look like we actually have some kind of semantic support.

1: For the JSON ingestion, I've defined a bunch of lightly configurable named ingestion helpers, and chrome-map.json will definitely get its own in the same way so maybe it could be used for other purposes but where our goal is not to reinvent jq or XSLT or anything. Especially since trees can always add their own jq script invocations in their setup or build scripts, one needn't modify the rust code as long as the mapping can be translated into any of the multiple idioms now supported. (I've opted to avoid the jq transforms for m-c because their readability tends to not be great and we have less ability to instrument the data-flow for explain-ability.)

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.