Open Bug 1796870 Opened 3 years ago Updated 6 months ago

Support using tree-sitter as a tokenizer and source of nesting data

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

Details

Currently we have custom tokenizers in tree for CSS, c-like languages, and tag-like languages. (Rust is ironically a C-like language! :) tree-sitter is a parser that is already used by https://github.com/mozilla/rust-code-analysis/ to power various mozilla machinery like https://github.com/mozilla/bugbug that tries to figure out what functions were changed by patches, etc. There's even a custom tree-sitter grammar dialect for mozilla-specific macrology/similar for C++. There's a tree-sitter playground that can be used to show the resulting AST/parse-tree for a variety of languages.

Switching to tree-sitter could let us have a potentially more resilient tokenizer with more language support than we can maintain ourselves, while also potentially allowing us to:

  • Derive our position:sticky nesting from tree-sitter rather than depending on the semantic analysis to provide it. This would allow position: sticky nesting across more languages and on non-trunk revisions of files. The original choice to do position: sticky via the analysis pass was done for reasons of simplicity and because we already had an AST, but is not strictly superior.
See Also: → 1800016
Blocks: 1800840

In terms of what's in-tree now:

  • In my enhancements to teach scip-indexer about nesting, it learned how to use tree-sitter to provide the sticky information. However, as noted in comment 0, this was done at the analysis data abstraction.
  • The WIP hyperblame support cst_tokenizer.rs operates at a tokenizer level. While it will initially be used to derive the traditional line-centric blame view from the underlying token-centric rep, a follow-up could evolve things so that we entirely use the tree-sitter tokenizer for tokenization purposes (and still join on the semantic data).
Blocks: 1879185
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
See Also: → 1998712
You need to log in before you can comment on or make changes to this bug.