Open Bug 1781179 Opened 2 years ago Updated 11 months ago

Improve Searchfox's C++ syntax/semantic highlighting

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: botond, Assigned: botond)

Details

Searchfox today highlights C++ code by classifying tokens into a relatively small number of categories (e.g. syn_type, syn_def, syn_string, syn_reserved, syn_comment, probably a few others I'm missing).

I would like to propose expanding this to a richer set of semantic token kinds (and possibly token modifiers), to make C++ code easier to read and understand visually.

As a point of comparison, and a potential target to aim for, here are clangd's token kinds and token modifiers, which are themselves based on (and lightly extend) the ones specified in the Language Server Protocol.

As a major fan of syntax highlighting, I am very on board with this! We should probably also file a bug on the plans on how to actually leverage this styling. During the dark theme work an initial idea was that we could just borrow devtools' themes as a basis, but if this is moving us into the territory of adopting VS code themes or needing to come up with our own more extensive themes for the extra token types, we should probably start thinking about that. I am planning to add a static settings page soon, so we can probably risk growing into a feature matrix that's the Cartesian product of ["light", "dark"], ["Just what I'm used to, thanks!", "All the colors!"] but probably don't want to go much beyond that.

An interesting design question here is how much should searchfox be emitting into the analysis "source" records in the "syntax" field and how much is something we should just be deriving from the "structured" records that provide canonical information about symbols and would have the benefit of running any global analyses like propagation of MOZ_CAN_RUN_SCRIPT. We currently have not adapted the style choosing logic to leverage this for C++, but we could. (Also, that logic would need to be updated for any changes we make here anyways.)

The structured record formal hierarchy section tries to capture all that can be emitted, but the implementations are straightforward for emitStructuredInfo for RecordDecl, FunctionDecl, and FieldDecl and the their shared common call-site.

One general argument in favor of relying on the "structured" records is analysis file size. Right now nsGlobalWindowInner.cpp is 7,989 lines and has a 4.6M analysis file against a source file size of 264K. While the size isn't a huge issue, it's potentially important for a possible magic feature where we map from fulltext regexp searches back to the underlying semantic tokens and where less analysis to process is arguably better until we change how we store the compressed analysis files to allow line-centric random access.

Assignee: nobody → botond

As a quick update in terms of implementation notes and follow-up to comment 1:

  • I did land the static settings page. I'm of the general mind that it's nice to be able to experiment in production and let people opt-in to experience that and the churn it entails, but not compel it. I also think feature matrices that introduce permutations can be a real maintenance problem and drag on enthusiasm for making enhancements. Which is to say:
    • First, Emilio currently should have the authoritative say on any of this :)
    • While I think it would be nice-to-have for people to be able to choose between levels of theming / gaudiness long-term, I don't think it's essential. The settings page can support user settings for that. Presumably by having a setting control an attribute in the DOM somewhere.
    • I do think it would be nice to use the settings page to initially feature gate any changes in syntax highlighting to be on an alpha quality bar (which would set a DOM attribute). This reduces the churn that people might experience.
  • In https://github.com/mozsearch/mozsearch/pull/603 as part of bug 1776522 I introduced the structured "jumpref" mechanism which provides us with metadata about symbols during the HTML formatting process that means we don't just have to base our rendering decisions on what's in the source record but can also rely on what's in the "structured" record.
    • Here's where the lookup happens in format.rs.
    • crossref_converter.rs is where we distill the data from the full crossref representation which includes any "structured" record plus all definitions, declarations, uses, etc.
    • It's also fine to just add more data to the "source" record "syntax" field, but I just want to make sure you're aware the structured data can be available via the jumpref, etc.
You need to log in before you can comment on or make changes to this bug.