Status

()

defect
P3
normal
2 years ago
5 months ago

People

(Reporter: mak, Unassigned)

Tracking

(Depends on 2 bugs, Blocks 2 bugs)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox54 affected)

Details

(Whiteboard: [fxsearch])

The current plan is to index urls and titles ONLY, no page contents.
The goal is to make the awesomebar able to match in a much faster way.

Some bullet points, a more detailed plan will come later:
1. The index should be in a separate db, that we can rebuild at will.
2. We need analysis to identify the text locale, for this we could try to use the Compact Language Detector library (https://github.com/google/cld3). It won't be perfect for short strings, for which we may need some guessing.
3. We need tokenization, is there any multi-locale tokenizer in icu? we may have to build our own, similar to Thunderbird's one. We want good tokenizers for most common western locales and bi-gram for CJK. What are the most common locales on the Web?
4. We need folding and maybe stemming. Stemming is hard, and we may have no dictionaries. Basic case folding is important, ICU should be able to do it.

Some interesting documentation:
Notes from Drew about prior investigation of FTS - https://wiki.mozilla.org/User:Adw/FTS
Very interesting document about multi-locale in ElasticSearch - https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html
Boundary analysis in ICU - http://userguide.icu-project.org/boundaryanalysis
Third party FTS5 notes - https://github.com/groue/GRDB.swift/blob/master/Documentation/FTS5Tokenizers.md
Depends on: 1425333
Blocks: PlacesIO
It's not clear to me why a locale-specific tokenizer is considered necessary when currently AUTOCOMPLETE_MATCH is not locale aware. It splits words on ASCII spaces (using nsCWhitespaceTokenizer), and doesn't perform proper case folding (it compares each codepoint, lowercasing them individually, which is wrong for some cases).

Given that, the stock unicode61 tokenizer seems like it would be more or less the same as what we do now (maybe even slightly better? I'm not how out of date case folding as defined by Unicode version 6.1 is), but likely much faster.

(To be clear: I do see why locale-aware tokenization is desirable in the long run, but this bug makes it seem like not having it is a deal breaker)
The current system is sub-par to what we actually need, as previously said even the Thunderbird tokenizer (bi-gram with some special CJK char handling) would work properly, as far as we're ok retaining the current matching.
Anyway, you are right that, by reducing our requirements to "what we do right now", we could do this far more easily.

Though, taking a solution without comparing against alternatives wouldn't be great, and that's likely where most of the work would go: comparing unicode61, modified-tri-gram, modified-thunderbird-cjk (Where modified indicates it should be able to recognize urls and tokenize them differently) against the most common locales we serve.
And the tokenizer is only part of the problem, we'd still miss a lot on stemming and folding, but we could do with tokenization as a first step.
Right, but we could do 'what we do now' and then move forward with implementing an improved tokenizer, no? Or is the issue one of compatibility (we don't want to rebuild the FTS index in schema upgrades, I guess?)
The idea would be that when the index is not available we use the existing matching, in the meanwhile we can rebuild the index in background, so it wouldn't be a big deal, I guess.
Oh, yes, but that's not what I meant.

I meant ask if there's a reason we couldn't implement FTS using unicode61 or the Thunderbird tokenizer in the short term, giving us something roughly equivalent to what we have now, and then work towards improving the tokenizer as time is available.

The only real downside to this that I see is that we'd have to rebuild the FTS index when changes are made to the tokenizer, but maybe there are others?
we need something on-par to what we have, that requires tests demonstrating that. For example I think we currently match on specific boundaries and unicode61 may act differently. Is that better or worse? We don't know without writing matching tests.
Also it has a non trivial cost: regardless of the tokenization we need the whole architecture to build the index in background, attach it, replace it on-the-fly. It's not impossible or particularly hard, but it needs a dedicated engineer for a few time.
Right, sure. My questions are partially/mostly directed around understanding why we aren't doing this in places, and whether or not those reasons make sense as reasons to continue not doing it in the rust places in https://github.com/mozilla/application-services (It sounds like they probably don't)

Ah, so I looked into doing this in mozilla/application-services (e.g. rust-places), and there's a larger issue than tokenizers: FTS queries don't support non-prefix matches.

That is to say, it only supports MATCH_BEGINNING, e.g. if you're searching for 'ample', there's no way to use the FTS index such that it would find 'example'. (exam would find it though, since it has good support for prefix matches, although they can have some perf issues unless you add a prefix index).

Because of this, it seems like you would need to always have the current logic present as a fallback. It's possible this would still be good enough, if we only use the fallback when the FTS query doesn't return enough results. Still, this would be a degradation in terms of search quality.

It's worth noting that we'd need a custom tokenizer still to implement our notion of 'boundary' also.

(In reply to Thom Chiovoloni [:tcsc] from comment #9)

That is to say, it only supports MATCH_BEGINNING, e.g. if you're searching for 'ample', there's no way to use the FTS index such that it would find 'example'. (exam would find it though, since it has good support for prefix matches, although they can have some perf issues unless you add a prefix index).

Because of this, it seems like you would need to always have the current logic present as a fallback. It's possible this would still be good enough, if we only use the fallback when the FTS query doesn't return enough results. Still, this would be a degradation in terms of search quality.

It's worth noting that we'd need a custom tokenizer still to implement our notion of 'boundary' also.

Yeah, I'm not sure in general the value of matching anywhere is so critical... For our use cases matching on boundaries may be enough and it's the default behavior. Surely to have proper boundaries we need a proper tokenizer, for urls for example.

As more Thunderbird use-case context, Thunderbird's global search autocomplete logic used to build[1] suffix trees[2] built against a preloaded set of the user's more popular contacts and tags/labels so that you could type "drew" and match "Andrew" even though the FTS index would never find that.

While it would be possible to build a suffix-tree just from the most "frecent" data, it should also be possible to use the fts5vocab virtual table[3] to periodically extract a "likely terms" list from the database and build a suffix-tree over those terms in order to use that as a basis for FTS queries for those discovered terms.

I should also note that Thunderbird's plan going forward had been to engage in proactive database sharding, partitioning databases based on both time and frecency-esque factors which were already used for ordering results lists. That way we would query a smaller set of data that was much more likely to contain what we'd display at the top of the results list, and then only continue sequentially opening and searching the other databases until we obtained enough results or reached an arbitrary threshold before showing a "search more" affordance.

1: It looks like autocomplete is broken on Ubuntu's current Thunderbird build, presumably due to platform changes.
2: https://en.wikipedia.org/wiki/Suffix_tree with my probably not amazing implementation at https://searchfox.org/comm-central/source/mailnews/db/gloda/modules/suffixtree.js
3: https://www.sqlite.org/fts5.html#the_fts5vocab_virtual_table_module

You need to log in before you can comment on or make changes to this bug.