The current plan is to index urls and titles ONLY, no page contents. The goal is to make the awesomebar able to match in a much faster way. Some bullet points, a more detailed plan will come later: 1. The index should be in a separate db, that we can rebuild at will. 2. We need analysis to identify the text locale, for this we could try to use the Compact Language Detector library (https://github.com/google/cld3). It won't be perfect for short strings, for which we may need some guessing. 3. We need tokenization, is there any multi-locale tokenizer in icu? we may have to build our own, similar to Thunderbird's one. We want good tokenizers for most common western locales and bi-gram for CJK. What are the most common locales on the Web? 4. We need folding and maybe stemming. Stemming is hard, and we may have no dictionaries. Basic case folding is important, ICU should be able to do it. Some interesting documentation: Notes from Drew about prior investigation of FTS - https://wiki.mozilla.org/User:Adw/FTS Very interesting document about multi-locale in ElasticSearch - https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html Boundary analysis in ICU - http://userguide.icu-project.org/boundaryanalysis Third party FTS5 notes - https://github.com/groue/GRDB.swift/blob/master/Documentation/FTS5Tokenizers.md
It's not clear to me why a locale-specific tokenizer is considered necessary when currently AUTOCOMPLETE_MATCH is not locale aware. It splits words on ASCII spaces (using nsCWhitespaceTokenizer), and doesn't perform proper case folding (it compares each codepoint, lowercasing them individually, which is wrong for some cases). Given that, the stock unicode61 tokenizer seems like it would be more or less the same as what we do now (maybe even slightly better? I'm not how out of date case folding as defined by Unicode version 6.1 is), but likely much faster. (To be clear: I do see why locale-aware tokenization is desirable in the long run, but this bug makes it seem like not having it is a deal breaker)
The current system is sub-par to what we actually need, as previously said even the Thunderbird tokenizer (bi-gram with some special CJK char handling) would work properly, as far as we're ok retaining the current matching. Anyway, you are right that, by reducing our requirements to "what we do right now", we could do this far more easily. Though, taking a solution without comparing against alternatives wouldn't be great, and that's likely where most of the work would go: comparing unicode61, modified-tri-gram, modified-thunderbird-cjk (Where modified indicates it should be able to recognize urls and tokenize them differently) against the most common locales we serve. And the tokenizer is only part of the problem, we'd still miss a lot on stemming and folding, but we could do with tokenization as a first step.
Right, but we could do 'what we do now' and then move forward with implementing an improved tokenizer, no? Or is the issue one of compatibility (we don't want to rebuild the FTS index in schema upgrades, I guess?)
The idea would be that when the index is not available we use the existing matching, in the meanwhile we can rebuild the index in background, so it wouldn't be a big deal, I guess.
Oh, yes, but that's not what I meant. I meant ask if there's a reason we couldn't implement FTS using unicode61 or the Thunderbird tokenizer in the short term, giving us something roughly equivalent to what we have now, and then work towards improving the tokenizer as time is available. The only real downside to this that I see is that we'd have to rebuild the FTS index when changes are made to the tokenizer, but maybe there are others?
we need something on-par to what we have, that requires tests demonstrating that. For example I think we currently match on specific boundaries and unicode61 may act differently. Is that better or worse? We don't know without writing matching tests. Also it has a non trivial cost: regardless of the tokenization we need the whole architecture to build the index in background, attach it, replace it on-the-fly. It's not impossible or particularly hard, but it needs a dedicated engineer for a few time.
Right, sure. My questions are partially/mostly directed around understanding why we aren't doing this in places, and whether or not those reasons make sense as reasons to continue not doing it in the rust places in https://github.com/mozilla/application-services (It sounds like they probably don't)
You need to log in before you can comment on or make changes to this bug.