Closed Bug 708889 Opened 14 years ago Closed 12 years ago

Take advantage of Elastic's support for non-English languages

Categories

(support.mozilla.org :: Search, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 889890

People

(Reporter: erik, Unassigned)

References

Details

A few of willkg's musings from https://etherpad.mozilla.org/sumo-elastic: zamboni maps locale codes (e.g. 'af', 'ar', 'bg', 'en-us', ...) to language names used by elastic (e.g. 'arabic', 'bulgarian', ...). there are locale codes for which there is no elastic support. these are loosely marked in apps/constants/search.py. in apps/addons/search.py, zamboni has a function setup_mapping() which has a section that has the comment "Add room for language-specific indexes". looks like they have a description_ field for each language and stick the value in the right place when indexing. (FIXME: check that) in sumo, questions and forums are all english, so they're easy. the wiki has language-specific content. three possibilities: 1. make language-specific indexes? 2. make language-specific types? 3. add a bunch of language-specific fields? FIXME: what to do here?
We will need some custom analyzers to support non-English stemmers. If we go with the single-index approach, we can do something like this in search.es_utils to define those analyzers: # Add a custom analyzer which strips HTML so we don't turn up all the <br> tags # when searching for "br": INDEX_SETTINGS = { 'index': { 'analysis': { 'analyzer': { # Like the snowball analyzer but with html_strip: 'snowballHtml': { 'type': 'custom', 'tokenizer': 'standard', 'filter': ['standard', 'lowercase', 'stop', 'snowball'], 'char_filter': ['html_strip']}}}} }
As a side note, I'm currently voting for language-specific fields. Then we tweak the query_fields to be the english field + locale-specific field for the query. Language-specific indexes prevents us from doing an "english + some other locale" query which is what I think we want to do. Language-specific types might be complicated since doctypes are mapped to Django ORM models.
Blocks: 889890
This is essentially the same thing we're doing in bug #889890, but we're doing a different solution there. Marking this as a dupe for that one. Mike: Might be worth skimming the comments here, but your approach is better.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.