Closed Bug 956849 Opened 10 years ago Closed 10 years ago

Stop doing OR queries for multi-word searches by default

Categories

(Webtools Graveyard :: DXR, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: erik)

Details

(Whiteboard: mxr-parity)

This is apparently driving more people than just me mad.  When we search for |interface EventTarget| for example, we should immediately get to the results as we would in MXR, without having to wrap the query in quotes.
Erik has been working on a new query parser, could this be incorporated into that work? If not, does it have to wait until after the new parser lands?

(also +1 for this, I would prefer spaces to be taken literally rather than meaning OR)
Flags: needinfo?(erik)
Have you been reading our roadmap again, Ehsan? ;-) Your request is actually a goal for this quarter.

The new query parser (landed now on the ui branch) was step one, fixing a lot of surprising, hard-to-define behavior around quotation marks, negation, and filter arguments, bringing DXR's internal understanding of what you type to a sane state that we can work from. For instance, it lets you express literal quotation marks, you'll be able to use quoted strings as arguments to filters, and the special quoting rules around regexes go away.

Step two is to take that sane internal state and do more intelligent things with it. The first of these is to dispense with the weird file-at-a-time-ANDed, line-at-a-time-ORed search logic we currently have and move to grep-style, line-by-line searching, ANDing query terms together as you suggest. That would probably suffice for your specific example, but I also plan to improve ranking after that, so that a line containing "interface EventTarget" will rank higher than "EventTarget interface". (That will require rethinking our visual presentation and possibly our storage backend, so we'll see how worthwhile it is after you and others have test-driven the simply ANDed search for awhile.)

You can see some other use cases on our roadmap under https://wiki.mozilla.org/DXR_UI_Refresh#Search. If you have any more, please do bring them up. Again, thanks for the feedback!
Flags: needinfo?(erik)
Nick: the spaces-taken-literally thing could go two ways. At the Summit, I and a few other people in the room seemed pretty happy with a words-ANDed-together semantic. As I said in the above comment, I'd like to add some smart ranking to improve upon that. However, if a clear majority of uses would benefit from literal spaces, I'd be open to redesigning the just-redesigned (argh!) query language around that. On a meta note, we really need to get some search query logging happening so we can make data-based decisions about things like this. If you have a pile of use cases to dump, please do make a heading on https://wiki.mozilla.org/DXR_Query_Language_Refresh and add some pros and cons.
(In reply to Erik Rose [:erik][:erikrose] from comment #2)
> you suggest. That would probably suffice for your specific example, but I
> also plan to improve ranking after that, so that a line containing
> "interface EventTarget" will rank higher than "EventTarget interface". (That
> will require rethinking our visual presentation and possibly our storage
> backend, so we'll see how worthwhile it is after you and others have
> test-driven the simply ANDed search for awhile.)

At the risk of going off on a tangent, I would strongly prefer searches for "EventTarget interface" to only return results for that string, not do a fuzzy search (or perhaps that could be an option like case sensitivity). Even more on a tangent, I don't like the idea of search ranking - I strongly want a predictable order for my search results - my use case is usually that I am searching for something specific (some identifier in some file, say) but I can't remember exactly what - so being able to quickly jump to a place in the search results, rather than having to scan them all is important (thus why that bug I fixed for putting files in alphabetical order was so annoying to me).
(In reply to Erik Rose [:erik][:erikrose] from comment #3)
> Nick: the spaces-taken-literally thing could go two ways. At the Summit, I
> and a few other people in the room seemed pretty happy with a
> words-ANDed-together semantic. As I said in the above comment, I'd like to
> add some smart ranking to improve upon that. However, if a clear majority of
> uses would benefit from literal spaces, I'd be open to redesigning the
> just-redesigned (argh!) query language around that. On a meta note, we
> really need to get some search query logging happening so we can make
> data-based decisions about things like this. If you have a pile of use cases
> to dump, please do make a heading on
> https://wiki.mozilla.org/DXR_Query_Language_Refresh and add some pros and
> cons.

I think ANDing ought to be as good or better than literal spacing, but I need to have a play to be sure. I think order is important though, I don't think I would ever search for stuff out of order, that's just not how I think about code.

I guess the distinction is between searching for a phrase vs ANDing. To me the former is more intuitive, but it might just be me.
I agree with Nick here.  I think we should opt for the least surprising results here, which means interpreting the query literally unless there are special tokens in it, and not ranking the results.
Don't panic. :-) "Ranking" has a different meaning for DXR than it does for Google. We will not scramble our search results; our users have made it clear that it's useful to have them in line order, grouped by file. What I'm talking about is more akin to how direct results currently work: some "darn likely" matches are "called out" in some way, maybe by just redirecting there, maybe by making them red and blinking and beneath a 3-D spinning skull, or maybe simply by being repeated in a colored box above the rest of the results.
(In reply to comment #7)
> Don't panic. :-) "Ranking" has a different meaning for DXR than it does for
> Google. We will not scramble our search results; our users have made it clear
> that it's useful to have them in line order, grouped by file. What I'm talking
> about is more akin to how direct results currently work: some "darn likely"
> matches are "called out" in some way, maybe by just redirecting there, maybe by
> making them red and blinking and beneath a 3-D spinning skull, or maybe simply
> by being repeated in a colored box above the rest of the results.

Oh that seems fine!  (except for the blinking and skull part, which we should use to render all search results with, not just the darn likely ones ;)
Whiteboard: [mxr-parity] → mxr-parity
Assignee: nobody → erik
The good news: line-by-line searching is implemented at https://github.com/erikrose/dxr/tree/line-by-line-search and passes all tests!

The bad: it performs so badly that I don't want to deploy it. Feel free to argue if the timings below look acceptable to you, but I think it would be a net worse experience.

I just got through with a round of benchmarking, and SQLite--without a completely redesigned schema, at least--is not going to hold up. Moving from file to line granularity is an expansion from 46000 rows to 15M, and no matter how I massage the query structure to tip off the planner, it comes down to an awful lot of loops.

I've tried a number of alternatives.

Initially, I tried cramming everything--extent fetching and all--into one query with clever left joins. That's what's at the top of the line-by-line-search branch now. The planner planned that pretty naively, and it performed very badly on a production-sized codebase.

I next tried some IN queries with unions and intersects and excepts. That was the best performer, coming back under a second for type queries but still lagging to 16s for simple type-ref ones (which involve an additional join) that find 678 occurrences. That timing does not include fetching extents, which would have to be an additional query for each result under this approach:

    -- What if a subquery returns a lot of lines? 16s for string (678 rows in union), 0.5s for int (9 rows).
    select files.path, files.icon, files.encoding, files.id as file_id, lines.id as line_id, lines.number, trg_index.text, extents(trg_index.contents)
    select count(*)
    from files
    inner join lines on files.id=lines.file_id
    inner join trg_index on trg_index.id=lines.id
    where lines.id in
    (
        select * from (
          select lines.id
           from type_refs as t0
           inner join types on t0.refid=types.id
           inner join lines on lines.file_id=t0.file_id and lines.number=t0.file_line
           where types.name like 'string' escape "\"

           union

          select lines.id
           from typedef_refs as t0
           inner join typedefs on t0.refid=typedefs.id
           inner join lines on lines.file_id=t0.file_id and lines.number=t0.file_line
           where typedefs.name like 'string' escape "\"
        )
    )
    order by files.path, lines.number
    limit 5;

The original query structure was right out, taking 36s for even a simple type query:

    select files.path, files.icon, files.encoding, files.id as file_id, lines.id as line_id, lines.number, trg_index.text, extents(trg_index.contents)
    from files, lines, trg_index
    WHERE files.id=lines.file_id
    AND trg_index.id=lines.id
    AND (EXISTS (
        select 1 from types as t
        where t.name like 'string' escape "\"
        and lines.file_id=t.file_id
        and lines.number=t.file_line
    )
    OR EXISTS (
        select 1 from typedefs as t
        where t.name like 'string' escape "\"
        and lines.file_id=t.file_id
        and lines.number=t.file_line
    ))
    order by files.path, lines.number
    limit 5;

We got away with that before only because 46000 (files) is small enough to loop around.

So I'm going to move straight on to ES-based searching. With denormalization, vectorized set bitmap operations, and parallelization (in order of decreasing importance), I have a reasonable expectation it will perform well. I will do benchmarking before writing any production code so we can change tactics if this, too, is slow.

Hang in there; I'm aiming to have this done this quarter. And now I'm all practiced up at rewriting the filter machinery. ;-)
For posterity, I also added these indices before the above performance tests:

create index lines_number ON lines (number);
create index type_name_index on types (name);
create index typedefs_name_index on typedefs (name);
Btw, the benchmarking was amazing. We can do even naive regex searches (with leading .*!) in 400ms and trigram accelerated ones in 100ms. Searches for simple key/value matches, like we'll use for simple text containment and structural queries, are on the order of 10ms. (And actually, that's for phrase matching. The dead-simple term matching used for structural queries should be even faster, but network latency starts to dominate at that point.) And that's all on my laptop.
This is implemented on the "es" branch, which should be staged in a couple of weeks.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.