Closed
Bug 708889
Opened 14 years ago
Closed 12 years ago
Take advantage of Elastic's support for non-English languages
Categories
(support.mozilla.org :: Search, defect)
support.mozilla.org
Search
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 889890
People
(Reporter: erik, Unassigned)
References
Details
Snowball, for example, might support a variety: http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer.html
Reporter | ||
Comment 1•14 years ago
|
||
A few of willkg's musings from https://etherpad.mozilla.org/sumo-elastic:
zamboni maps locale codes (e.g. 'af', 'ar', 'bg', 'en-us', ...) to language
names used by elastic (e.g. 'arabic', 'bulgarian', ...). there are locale
codes for which there is no elastic support. these are loosely marked in
apps/constants/search.py.
in apps/addons/search.py, zamboni has a function setup_mapping() which
has a section that has the comment "Add room for language-specific indexes".
looks like they have a description_ field for each language and stick the
value in the right place when indexing. (FIXME: check that)
in sumo, questions and forums are all english, so they're easy.
the wiki has language-specific content. three possibilities:
1. make language-specific indexes?
2. make language-specific types?
3. add a bunch of language-specific fields?
FIXME: what to do here?
Reporter | ||
Comment 2•14 years ago
|
||
We will need some custom analyzers to support non-English stemmers. If we go with the single-index approach, we can do something like this in search.es_utils to define those analyzers:
# Add a custom analyzer which strips HTML so we don't turn up all the <br> tags
# when searching for "br":
INDEX_SETTINGS = {
'index': {
'analysis': {
'analyzer': {
# Like the snowball analyzer but with html_strip:
'snowballHtml': {
'type': 'custom',
'tokenizer': 'standard',
'filter': ['standard', 'lowercase', 'stop', 'snowball'],
'char_filter': ['html_strip']}}}}
}
Comment 3•14 years ago
|
||
As a side note, I'm currently voting for language-specific fields. Then we tweak the query_fields to be the english field + locale-specific field for the query.
Language-specific indexes prevents us from doing an "english + some other locale" query which is what I think we want to do.
Language-specific types might be complicated since doctypes are mapped to Django ORM models.
Comment 5•12 years ago
|
||
This is essentially the same thing we're doing in bug #889890, but we're doing a different solution there.
Marking this as a dupe for that one.
Mike: Might be worth skimming the comments here, but your approach is better.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•