Open Bug 1340155 Opened 8 years ago Updated 24 days ago

Allow to load the ICU extension

Categories

(Core :: SQLite and Embedded Database Bindings, defect, P3)

defect

Tracking

()

Tracking Status
firefox54 --- affected

People

(Reporter: mak, Unassigned)

References

(Blocks 1 open bug, )

Details

Since now we ship ICU by default, we could start making use of it.
Possible uses:
1. sorting with a specific collation
2. FTS tokenizer
3. FTS folding
4. lower/upper with a specific collation

Since we support system sqlite, we'll have to ship our own icu extension library, and we should use the Runtime loadable extensions feature.

we clearly need to add the extension code to our repo, it will be updated at the same time as Sqlite.

We must set SQLITE_DBCONFIG_ENABLE_LOAD_EXTENSION with care to only enable the C-API. The SQL function should stay disabled, for security and stability reasons.

We should expose a specific enableICU(bool) from mozIStorageConnection. We don't want to expose a general method to load any extension, I think, we only want a subset of extensions we keep updated. I also don't think we want to enable it for any new connection, at least at the beginning.

If we don't want an additional dll around, we can use -DSQLITE_CORE to statically build the extension along with Storage and ICU into libxul (or into nss3 where sqlite would be if it's not system sqlite? does one of the 2 make more sense?).

We may want to disallow icu_load_collation, per the security concerns expressed in https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt at least until we figure whether we need it.
(In reply to Marco Bonardo [::mak] from comment #0)
> We must set SQLITE_DBCONFIG_ENABLE_LOAD_EXTENSION with care to only enable
> the C-API. The SQL function should stay disabled, for security and stability
> reasons.
> 
> We should expose a specific enableICU(bool) from mozIStorageConnection. We
> don't want to expose a general method to load any extension, I think, we
> only want a subset of extensions we keep updated. I also don't think we want
> to enable it for any new connection, at least at the beginning.

Agreed.  Related: Bug 1270882 to expose a mozIStorageConnection method for registering custom FTS3 tokenizers.

Also interesting is whether Thunderbird's custom tokenizer could be discarded in favor of an ICU-based one in core.  Thunderbird's custom tokenizer currently has the (2) FTS tokenizer and (3) FTS folding boxes ticked, but also has cleverness related to emitting CJK characters as bi-grams.  While we shouldn't optimize for Thunderbird's needs, Thunderbird's needs were generic mixed multi-lingual full-text search, and I would expect things that want FTS to actually want mixed multi-lingual FTS.  (Like Activity Streams?)  It's possible ICU may be able to do better than bi-gram indexing if it has the tables (and we have compiled them in, unlikely) to word-break based on exact or statistical understanding of where word boundaries are or are likely to be.

I'm needinfo-ing :m_kato, the author of Thunderbird's fancy tokenizer, for his thoughts.
 
> If we don't want an additional dll around, we can use -DSQLITE_CORE to
> statically build the extension along with Storage and ICU into libxul (or
> into nss3 where sqlite would be if it's not system sqlite? does one of the 2
> make more sense?).

I would be personally biased towards not adding an additional dll and keeping the extension with the SQLite code (so, in nss3) because:
- There tends to be build/installation hassle related to additional dll's.
- A separate dll seems like a tempting target for anti-virus libraries to replace wholesale in an attempt to provide some notional concept of "security", but do so in such a way that makes Firefox crash-prone. (I know we have mechanisms to avoid external DLL injection, but I'm not sure about whether that addresses wholesale replacement.)

> We may want to disallow icu_load_collation, per the security concerns
> expressed in
> https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt at
> least until we figure whether we need it.

This seems wise.  In general, I think collation consistency is potentially a foot-gun and it may be appropriate for mozStorage to provide some mechanism for maintaining the appropriate invariants and requiring that consumers use the mozStorage solution, enhancing it to meet their needs as appropriate.  For example, building on the "BackgroundSync API" changes that refactor schema creation/upgrades into mozStorage from dom/cache as part of bug 1217544.

Collation-wise, note that Gecko's IndexedDB has (Gecko-specific, non-standard) prior art on this in bug 871846 which added locale-aware indices (see bug 871846 comment 102 onwards for doc links).  It's not directly comparable because IDB uses its own custom key-encoding mechanism, using the ICU collation API directly to create an encoded version of the string as part of the binary key.  (See http://searchfox.org/mozilla-central/search?q=EncodeLocaleString).  IDB is then able to regenerate the index (without depending on the prior locale ordering) if the browser's configured locale changes ("auto" is the browser's locale, which can change), or if the ICU library is upgraded and locale collation mappings might have been modified.  (Although I don't think IDB does the latter.  I don't see a static_assert on the ICU version changing like it does for the structured clone version changing.  But it's also not a big deal for IDB; because of its semantics and implementation, even if you randomized the index order, nothing would crash.)
Flags: needinfo?(m_kato)
See Also: → 1270882, 871846
Since our ICU build option doesn't have break iterator (and regexp) due to dictionary / data size, we cannot still use FTS/ICU although I have landed ICU as default.  (Most reason for size issue is for Fennec.)

Bi-gram indexing is better for multi-lingual. Because, although Chinese and Japanese use Kanji character, dictionary for breaking is different and normalization rule for indexing isn't same.

Also, ICU's break iterator will use dictionary for some languages, so that we have to detect page language correctly at least before using it.
Flags: needinfo?(m_kato)
(In reply to Andrew Sutherland [:asuth] from comment #1)
> Also interesting is whether Thunderbird's custom tokenizer could be
> discarded in favor of an ICU-based one in core.

We should try to converge towards a common solution. Places needs are similar, we need multi-locale tokenizer and folding, but it's a bit simpler cause (for now) we don't plan to index content of the pages, only urls and titles.

> It's possible ICU may be able to do better than
> bi-gram indexing if it has the tables

It can do better, but as Kato-san said, it would need dictionaries. Though, we could probably replace custom code for icu code even for the existing cases.

> I would be personally biased towards not adding an additional dll and
> keeping the extension with the SQLite code (so, in nss3)

I'm just not sure if it'll work when Sqlite is in nss3 but icu is in libxul. The extension requires icu.c, so it may have to stay with icu. But we agree we don't want a new shared library around.

> Collation-wise, note that Gecko's IndexedDB has (Gecko-specific,
> non-standard) prior art on this in bug 871846 which added locale-aware
> indices (see bug 871846 comment 102 onwards for doc links).

Interesting, it's one of the things that would be nice to have with the icu extension indeed. At least from an ORDER BY point of view.

> IDB is
> then able to regenerate the index (without depending on the prior locale
> ordering) if the browser's configured locale changes ("auto" is the
> browser's locale, which can change), or if the ICU library is upgraded and
> locale collation mappings might have been modified.

This is something that Storage should protect against. I'm not sure how yet.

The idea to allow using the ICU extension is mostly for 2 things:
1. experimentations with tokenizers/folding
2. collation sorting

I hope these 2 things are feasible with the current state of ICU, even without full dictionaries.
I honestly don't think in ICU we'll find a ready and final solution, because a solution for multi-locale doesn't really exist yet (I saw some experimentation about neural networks breaking, but it's not ready for products).

What we'll need is a process that involves analysis (I was actually evaluating the CLD2 library to try to have a guess on the string locale), tokenization (we could use tokenizers for the most common western languages and bi-gram for CJK, similarly to Thunderbird), stemming and folding (hopefully we can do these with the current ICU). The result is unlikely to be perfect, but the current urlbar situation (TABLE SCAN) is not better and it's very slooooow.
Blocks: 1340487
Severity: normal → S3
Product: Toolkit → Core
You need to log in before you can comment on or make changes to this bug.