Closed Bug 1762817 Opened 2 years ago Closed 2 years ago

Proposal/plan: Introduce new "query" search endpoint supporting per-tree "presets" and based on extensible "term:value" syntax with pipes

Categories

(Webtools :: Searchfox, enhancement)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asuth, Assigned: asuth)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

Attachments

(2 files)

This bug is intended to serve as a proposal and place of discussion about the direction of the searchfox query syntax for the new rust web server as we begin to implement new features. I've previously strewn thoughts around various bugs and commit messages; this bug is an attempt on my part to describe my current thinking and plans in one place. The most notable location for my prior thoughts on syntax was prior to the creation of the searchfox tool at https://bugzilla.mozilla.org/show_bug.cgi?id=1707282#c0 but that syntax was intended primarily for automated testing which is a distinct use-case and has proven entirely unsuitable as a user-facing syntax.

Important Note! Searchfox's architecture means that we can continue to operate the existing web-servers (router.py and web-server.rs) and maintain their UX effectively indefinitely at marginal resource cost. There will be no need for any abrupt transitions (but they also won't magically gain new features).

Existing Searchfox "search"

Syntax

Queries are expressed over 4 GET parameters:

  • q: The query string.
  • case: A boolean-ish value corresponding to case-sensitivity. If omitted or anything but "true" we are case-insensitive, if present and the value is "true" we are case-sensitive.
  • regexp: A boolean-ish value corresponding to whether q should be interpreted as a regular expression. In this case we only perform a "Textual Occurrences" livegrep/codesearch query. Note that while codesearch supports matching paths and we provide path information to it, we never process the file_results, and we don't run our own grep-based search_files, so this is purely a file contents search when regexps are enabled.
  • path: A very capability path globbing mechanism with non-intuitive/non-obvious implications:
    • If q included a symbol:FOO or id:FOO, the pathre is used to filter semantic results.
    • However, if the q was just FOO then the default handler will skip searching files and will skip performing an identifier search and only do a livegrep/codesearch for "Textual Occurrences".
    • If there was no q payload at all, the path will only be applied against the list of files using seasrch_files.

The q query syntax supports "term:value" syntax, with the following terms supported, many of which are effectively undocumented/secret, as the root https://searchfox.org/ page at https://github.com/mozsearch/mozsearch-mozilla/blob/master/help.html only document the separate path filter UI:

  • path: - Handled exactly the same as the path GET parameter from the separate path filter UI box. Globbing syntax is transformed into a regular expression. This option will be clobbered by the path GET parameter if it is present.
  • pathre: - Allows a path regular expression to be passed as-is without any transformations.
  • context: - Secret mechanism for having livegrep/codesearch provide us with 0-10 lines of context for "Textual Occurrences". This is explicitly still a secret because it doesn't work for semantic results yet, although it's my explicit goal to support context everywhere for this bug.
  • symbol: - If present we only perform a symbol search. . is normalized to # which matters for our encoding of JS symbols. The parsing logic here is very questionable (broken) if any other terms come after symbol:, but our convention is to use comma-delimited symbols and that's how it's processed, so it probably doesn't come up much. Presumably having the ' ' join replaced with ',' and breaking out of the loop would accomplish what the logic was trying to do.
  • re: - Treats the remainder of the query string as a regexp, like if the regexp GET parameter had been passed. Using this notably allows also specifying a context: as compared with using the checkbox.
  • text: - Treats the remainder of the query string as an exact match text-string which is accomplished by regexp-escaping the string and then treating it as a regexp. This forces livegrep/codesearch-only as explained above.
  • id: - Performs an exact-match identifier search. This is notably different from the default case identifier search which is only a prefix match. Search parsing continues after this, although it's not clear that this enables any additional functionality.
  • DEFAULT - This is what happens if we see a token that doesn't meet any of the above. We join the rest of the query string with spaces, regexp-escape it, and put it in default which provides for the combination of a livegrep/codesearch "Textual Occurrences" search, plus if a pathre was not provided via path GET parameter or path: or pathre: query syntax, a file search via search_files and a prefix-based identifier_search are performed.

What's Great About The Current Syntax

Although searchfox's syntax turns out to be more powerful than the documentation that claims "No search operators are supported.", arguably one of the strongest points of searchfox is that the claimed lack of syntax means that there's no up-front decision making required. You type the thing that you're looking for and you get results and you perhaps iterate from there. Except when result limits are hit, you usually can have a high degree of confidence that the results you want are in the page if you didn't make any typos.

Anecdotally, this iteration can often take the form of using the ctrl-f "find in page" functionality rather than using the built-in functionality, especially since path filtering on the "default" search will exclude semantic results. For example, if looking for a webidl interface, one might ctrl-f search for "webidl". And if looking for code that is known to exist in a general subsystem with a know-to-the-user path segment like "xpcom" or "mfbt", that's also easy to ctrl-f for, unlikely to result in false positives, and has much more predictable response times than sending the query back to the server (from a user perspective and sometimes in practice).

Lessons from DXR

https://wiki.mozilla.org/DXR_Query_Language_Refresh describes the ~20 filters DXR supported

From user feedback artifacts

We have a number of user feedback artifacts around discussions of DXR throughout its lifetime and some from the announcement of searchfox. Note that many pieces of DXR feedback were addressed, and are noted here as indications of the importance of those features.

Various lessons related to searches:

  • Latency/speed is incredibly important.
  • Having to use a special search filter to get the result you want and having the possibility of not being sure if there are 0 results or you used the wrong filter can be very frustrating.
    • A common complaint with DXR early on was that to find a file with a word in its name you had to manually use "path:".
    • Apparently this was also the case for needing to know if something was a macro or function or var, etc. It seems like this was later handled by id?
  • Prioritizing definitions is good.
  • Quoting can be confusing and dependent on muscle memory. (mxr had quotes be part of the search string, dxr seemed to have quotes be consumed by the query parser, searchfox definitely treats them as part of the search string.)
    • This can be even more confusing when multi-term support is present. An example was given of the tri-gram indexing era DXR where searching for one two would show files that included both the words one and two rather than searching for the phrase "one two". (This does seem like a situation where clearly explaining what the search actually did would help people learn quickly...)
    • Arguably if you support multiple terms that are subject to union/intersection then you inherently need to treat quotes as starting a phrase unless you go with a very unconventional approach like requiring required phrase with "quotes" in it and:"second required phrase but \"had to escape these quotes\""
  • Incremental results can make people feel not in control, especially if it's not clear when the results are "done".
  • Query syntax failure modes are important; ex: people trying to use regexp mode and forgetting the required trailing "/" broke things in a non-obvious way.
  • Requiring use of semantic predicates can lead to confusion when semantic analysis doesn't understand something. An example was given of trying to use DXR's "function:" filter for python code, but DXR didn't understand python.
  • Documentation and discovery mechanisms are important; people did not understand what all the semantic predicates could do.
  • People want extra lines of context, possibly not just a fixed number of lines, but just excerpting a statement in its entirety (like peekRange/peekLines).
  • People like magic like for IDL attributes where the "foo" attr is unified with GetFoo/SetFoo as appropriate.
  • People were really concerned about being able to switch between multiple trees!

Proposal

Goals relating to the querying process: syntax, refinement, UX

  • Latency/speed continues to be the top priority and this should inform architectural decisions. In particular, results should not require any client-side JS logic to run and issue new dynamic requests. (It is, however, acceptable if the server response is structured to incorporate sub-resources to maximize caching while maintaining parallelism and minimizing overall UX latency, such as by initiating the cached generation of those resources locally on the server without waiting for the client to request them.)
  • Stability / finality of search results. It should be clear when we're done getting results from the server (particularly as a result of search-as-you-type, if still present) and the layout and results should be stable at that point so the user can start processing.
  • Favor interactions that start with a simple query and then can be interactively refined or traversed from. But ensure that the results of the refinement/traversal are encoded in a query that can then be shared and/or permuted. This would also enable experts and/or those with different accessibility/interaction patterns to construct exact queries
    • For example, if the user is looking for the subclasses of interface "Foo" or overrides of method "Bar", encourage a flow where the user first searches for "Foo" or "Bar" and then can navigate from there, with the results potentially already including the results directly.
  • Support shared customization of queries, recognizing that mozilla-central is worked on by many teams who will each have their own areas of specific interest in the codebase and common searches that they are likely to want to be able to assign potentially scarce short identifiers to.
  • Support diagrams!

Lesser Goals: Support, but don't design around

  • multi-line search: find term A within N lines of term B. We should definitely be able to support this, but it's okay for the syntax for this to look weird.

Specific Proposal

  • We use a "term:value" syntax generally, where the terms can be drawn from presets. We'd probably use :kats' https://crates.io/crates/query-parser for this.
  • The presets are expressed in the URL route scheme /:tree/query/:preset
  • Presets can come from the tree being indexed (the longer term approach for self-maintenance, ex under .searchfox or otherwise identified by a "repo_files" style mechanism which could potentially allow for moz.build to identify them, etc.) or the mozsearch-mozilla config repo (the initial approach).
  • Presets map down to some combination of:
    • Other "term:value" expansions. For example, webidl: might expand to default:$0 path:dom/webidl,dom/chrome-webidl.
    • searchfox-tool command-line style looking pipeline introduced in bug 1707282. All term expansions in fact must end up mapped to this pipeline.
    • Data maps.
      • For example, a team might want to prioritize the results from the code they own. This might be accomplished by listing the list of bugzilla "product :: component" pairs the team owns, which can build on the moz.build mappings from every file. Or people might otherwise curate subsystem mappings that could help with the faceting of results. These could be provided to the results processing to help prioritize the results list and/or provide for easy faceting/
      • Documentation / example metadata. The UI would eventually support listing/searching known terms and it would be nice to be able to have some level of documentation and perhaps series of example usages of the command that the user can directly try out.
  • This syntax also would support pipes! It's likely the mapping process for terms would already know when a new pipeline step must be added, but it seems like the pipe could help with readability of pipelines, and in some cases might be necessary to eliminate ambiguity where the application of a term could make sense in the current pipeline segment or in its own (new) segment.
  • The underlying pipeline steps each have specific underlying possible types with inherent presentation options. Some of these would automatically be coerced by the addition of an additional pipeline step.
    • For example IdentifierList would always want to be passed to search-identifiers which produces a SymbolList and SymbolList would always want to be passed to crossref-lookup and then that's what the normal search results mechanism consumes.
    • One of the things I want to try with this new setup is using show-html to excerpt the fully rendered HTML output for showing the results. The pipeline command would gain a new mode to support this and we might end up with a new type of pipeline value to support it. This might also come after a new pipeline command like compile-results that would flatten results into an ordered list and do result count and work limiting. This could also inform show-html which perhaps wouldn't want to extract the HTML for all results, just the first 100 or something.
      • show-html is an example of a command that could parallelize itself, leveraging our use of tokio.
  • The presentation/compilation step would also have various heuristics that could run a number of additional pipeline steps in parallel, for example:
    • For top-ranked class results, traversing the subclass/superclass relationships (with safety limits) and diagramming them.
    • For top-ranked method results that are overridden or override something, traverse the relationships (with safety limits) and diagramming them, including relevant aggregate statistics like showing how many callers exist for each method and providing annotations that allow the client to respond to clicks to help facet the results list.

Easy Performance Improvement Options

The mozilla-central tree is without a doubt our most important tree. We can allocate resources specifically to maximize the performance of this tree and help ensure that other trees do not hinder its performance.

The mozilla-central crossref file is 2.0G. The crossref-extra file is 4.0G. We currently use t3.large instances for our web-servers which have 8.0G of RAM and 2 vCPUs, but we're a bit wasteful about the number of old instances we keep around. t3 instance at least are priced so that the cost of running 2 of 1 tier is ~the same price as running 1 of the next tier up. If we addressed the backup server inefficiencies, we could move to t3.xlarge with 4 vCPUs and 16.0G of RAM and help ensure that both crossref files are kept in RAM at all times. We could also potentially kick other indexes out of config1 so it's just mozilla-central, but the others in there right now aren't all that large (313M + 706M for comm-central, 25M+29M for mozilla-mobile, 50M+79M for nss).

We also currently use EBS for storage, and we could potentially consider instead using local SSD or paying for fancier EBS storage or something.

We also don't currently pre-heat the nginx caches when starting up the web-server by replaying queries that users previously have made, but we absolutely could do that.

Depends on: 1763005

Progress stack landed at https://github.com/mozsearch/mozsearch/pull/505

Brief overview:

An example test invocation is:

query --dump-pipeline foo.bar+hats()

With resulting a pipeline dump output (at the current time, this will change to include a pipeline junction of a new compile-results and potentially an automatic expand-results stage for the semantic-search group added to the pipeline, plus more):

{
  "groups": {
    "file-search": {
      "segments": [
        {
          "command": "search-files",
          "args": [
            "foo.bar+hats()"
          ]
        }
      ]
    },
    "semantic-search": {
      "segments": [
        {
          "command": "search-identifiers",
          "args": [
            "foo.bar+hats()"
          ]
        }
      ]
    },
    "text-search": {
      "segments": [
        {
          "command": "search-text",
          "args": [
            "--re='foo\\.bar\\+hats\\(\\)'"
          ]
        }
      ]
    }
  }
}

Note: There may be some premature or over-escaping going on with the re value payload. The intent when passing things to structopt/clap is to avoid re-stringifying the separate arguments if possible, which means the shell quoting going on there may be wasted; also the key/value may end up split, etc. etc.

Example final query pipeline graph for query "context:4 'DoubleBase::doublePure'" (C:4 also is equivalent to context:4).

{
  "groups": {
    "display": {
      "input": "compiled",
      "segments": [
        {
          "command": "augment-results",
          "args": [
            "--after=4",
            "--before=4"
          ]
        }
      ],
      "output": "result",
      "depth": 0
    },
    "file-search": {
      "input": null,
      "segments": [
        {
          "command": "search-files",
          "args": [
            "DoubleBase::doublePure"
          ]
        }
      ],
      "output": "file-search",
      "depth": 0
    },
    "semantic-search": {
      "input": null,
      "segments": [
        {
          "command": "search-identifiers",
          "args": [
            "DoubleBase::doublePure"
          ]
        },
        {
          "command": "crossref-lookup",
          "args": []
        },
        {
          "command": "crossref-expand",
          "args": []
        }
      ],
      "output": "semantic-search",
      "depth": 0
    },
    "text-search": {
      "input": null,
      "segments": [
        {
          "command": "search-text",
          "args": [
            "--re=DoubleBase::doublePure"
          ]
        }
      ],
      "output": "text-search",
      "depth": 0
    }
  },
  "junctions": {
    "compile": {
      "inputs": [
        "file-search",
        "semantic-search",
        "text-search"
      ],
      "command": {
        "command": "compile-results",
        "args": []
      },
      "output": "compiled",
      "depth": 0
    }
  },
  "phases": [
    {
      "groups": [
        [
          "file-search"
        ],
        [
          "semantic-search"
        ],
        [
          "text-search"
        ]
      ],
      "junctions": [
        "compile"
      ]
    },
    {
      "groups": [
        [
          "display"
        ]
      ],
      "junctions": []
    }
  ]
}

https://github.com/mozsearch/mozsearch/blob/master/tools/src/query/query_core.toml is the current TOML config that provides the query mapping.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Blocks: 1799796
Blocks: 1799802
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: