Closed Bug 889890 Opened 11 years ago Closed 11 years ago

[research][discuss] figure out how to improve our l10n situation with search

Categories

(support.mozilla.org :: Search, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED
2013Q3

People

(Reporter: rrosario, Assigned: mythmon)

References

Details

(Whiteboard: u=user c=search p=3 s=2013.14)

We treat all content and queries as English. That's wrong.

This bug is about coming up with a plan to make it better. I don't know all the details but I know there are l10n-specific index-time things and query-time things.

We should talk to :robhudson to learn what they are planning on the AMO/Marketplace side.

I think the way we want to do this is:
1- Propose some options
2- Discuss amongst ourselves
3- Maybe prototype something
Oops, meant to CC :robhudson, not assign it to him :)
Assignee: robhudson.mozbugs → mcooper
Happy to help and/or point to code.

At least for Marketplace, Elasticsearch doesn't support all the languages that we do. The best we can do in that case is store the langauges we support and spit them back out but use the default analyzer.

For those ES does support, we have something like this:
https://github.com/mozilla/zamboni/blob/master/apps/constants/search.py

And we use that in our Elasticutils indexer...

The mapping:
https://github.com/mozilla/zamboni/blob/master/mkt/webapps/models.py#L1049-L1060

Document extraction:
https://github.com/mozilla/zamboni/blob/master/mkt/webapps/models.py#L1204-L1217

Then when searching, we use the analyzer if the current request locale matches:
https://github.com/mozilla/zamboni/blob/master/mkt/search/views.py#L61-L63

The alternate way to handle this is to store completely separate docs with a field to designate which locale the doc is in and filter for those docs and use the appropriate analyzers.
We have separate docs already so we probably want to go with the alternate way.

Can you specify the analyzer to use for a given field at indexing time?
I was told by one the trainers that yes, you can choose which analyzer to use per field at index time, and this was one way they suggested doing multi-language things when you know what the language is. I don't yet know how to do this, I need to do more research.

Note that we also need to change the analyzer that is used at search time for the user's input, so that it matches whatever language they are searching in.
Status: NEW → ASSIGNED
Proposal: Use a single field for document content, and index that field differently depending on the locale of the document. When performing searches, take that into account.

-----

In our current mapping fields are analyzed in one of the three ways: as numbers, "not_analyzed", or use snowball. Snowball is pretty great for some languages, but doesn't handle languages like Chinese, which is what we are trying to improve here. I am assuming that numbers do not use the default analyzer, but instead have an implicit analyzer just for numbers. I need to look into this more carefully to be sure.

Elastic has the ability to change the default analyzer on a per-document basis by keying off a field in the document. What I propose is that every time we index a wiki document, we also choose an indexer to use as the default analyzer for that document, and change the fields that currently use snowball to fall back to use that default.

This means that there is only one field of each kind (body, title, etc) to search on, unlike AMO's solution. We still have to keep a mapping of locale codes to ES analyzers, like AMO, and we will have to take the varying analyzers into account while searching.

When a user submits a search, ES analyzes that too in the same way it analyzes documents (though the analyzer for search and for indexing can be different). This way ES is comparing apples to apples when doing search. In order to accommodate the variable analyzer for the document above, we need to vary the analyzer on the searches as well, depending on what language the user is searching in. If we guess the wrong language for the user's input, they are (likely) no worse off than they are right now, and if we guess correctly, they should get better search results.


Since words are cheap and ambiguous, here is some code:

Here is the new mapping for wiki.models.Document

    @classmethod
    def get_mapping(cls):
        return {
            'properties': {
                # These have has "analyzer: snowball" removed.
                'document_summary': {'type': 'string'},
                'document_keywords': {'type': 'string'},
                'document_title': {'type': 'string', 'analyzer': 'snowball'},
                'document_content': {'type': 'string', 'analyzer': 'snowball',
                                     'store': 'yes',
                                     'term_vector': 'with_positions_offsets'},

                # unmodified properties omitted.
            }
        }

A mapping from locale to analyzer would be made, likely stolen from AMO:

    ES_LOCALE_ANALYZERS = {
        'en-US': 'snowball',
        'es': 'spanish',
        'zh-CN': 'chinese',
        ...,
    }

The new indexing code would key into the above mapping and set appropriate fields on the document:

    def extract_document(cls, obj_id, obj=None):
        ...
        d['_analyzer'] = settings.ES_LOCALE_ANALYZERS.get(obj.locale, 'snowball')
        ...

NB: _analyzer is the default name of the field that controls per-document analyzer choosing. We could customize this, but I don't really see why we need to.

The new search code would have to specify what analyzer to use, based on language, as well. This is done by setting "analyzer" to the desired analyzer on the search query. I'm a little fuzzy on exactly how this is done, but it does work. I don't know if elasticutils or pyes support this exact use case yet, so that will need to be figured out.

---

I'm pretty sure this will work well for us. It has the downside of being different from what Marketplace is doing, but I think it is a better solution because it doesn't involve having one field per locale for each of title, body, keywords, and summary. Thoughts?
:willkg asked me some questions in IRC regarding the above, so I wanted to reproduce the answers here:

0. My note about numbers and analyzers reveals that I don't know a lot about how ES indexes things. In reality numbers are not analyzed at all. This is good for this proposal though, and validates my assumption even though my assumption was reached by poor reasoning.

1. In the new mapping proposed, I forgot to remove "'analyzer': 'snowball'" from document_title and document_content. This was a mistake, they should have that removed.

2. The point of removing the analyzer in document_{summary,keywords,title,content} was so that they use the default analyzer for the document, which is set at index time.

3. Does the version of ES support all of the above? I think it does, but I am going to work on a getting a test script that will verify that.
Here is an example script that indexes three documents, one each in English, Spanish, and Russian, asks ES how they were analyzed, and then runs some searches. Notice that all three languages are getting stemmed, and that that stemming works in searches too.

----

echo "Deleting the test index"
curl -XDELETE localhost:9200/test-idx
echo
echo "Creating the test index"
curl -XPUT localhost:9200/test-idx -d '{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  }
}'
echo
echo "Setting mapping"
curl -XPUT localhost:9200/test-idx/content/_mapping -d '
{
  "content": {
    "properties": {
      "body": {
        "type": "string",
        "store": "yes",
        "index": "analyzed",
      },
    }
  }
}'

echo
echo "----------------------------------------------------------------------"
echo "Indexing an English document"
curl -XPUT localhost:9200/test-idx/content/1 -d '{
  "body": "Setting your home page in Firefox is easy.",
  "_analyzer": "snowball"
}'

echo "Indexing a Spanish document"
curl -XPUT localhost:9200/test-idx/content/2 -d '{
  "body": "Establecer la página de inicio en Firefox es fácil.",
  "_analyzer": "spanish"
}'

echo "Indexing a Russian document"
curl -XPUT localhost:9200/test-idx/content/3 -d '{
  "body": "Установить домашнюю страницу в Firefox очень легко.",
  "_analyzer": "russian"
}'

curl -XPOST localhost:9200/test-idx/_refresh

echo
echo "----------------------------------------------------------------------"
echo "List of docs"
curl -XGET localhost:9200/test-idx/_search\?pretty=true -d '{
 "query": {
  "query_string": {
   "query": "_id:1"
  }
 },
 "facets": {
  "terms1": {
   "terms": {
    "field": "body"
   }
  }
 }
}'
echo

curl -XGET localhost:9200/test-idx/_search\?pretty=true -d '{
 "query": {
  "query_string": {
   "query": "_id:2"
  }
 },
 "facets": {
  "terms1": {
   "terms": {
    "field": "body"
   }
  }
 }
}'
echo

curl -XGET localhost:9200/test-idx/_search\?pretty=true -d '{
 "query": {
  "query_string": {
   "query": "_id:3"
  }
 },
 "facets": {
  "terms1": {
   "terms": {
    "field": "body"
   }
  }
 }
}'
echo

echo
echo "----------------------------------------------------------------------"
echo "Ok, lets make some searches"

echo "Russian, search for домашню (which is less than a complete word in the document)"
curl -XGET localhost:9200/test-idx/_search?pretty=true -d '{
    "query": {
        "match": {
            "body": {
                "query": "домашню",
                "analyzer": "russian"
            }
        }
    }
}'
echo
echo "Spanish, searching for 'paginas'"

curl -XGET localhost:9200/test-idx/_search?pretty=true -d '{
    "query": {
        "match": {
            "body": {
                "query": "paginas",
                "analyzer": "spanish"
            }
        }
    }
}'
echo
echo "English, seraching for 'home page'"

curl -XGET localhost:9200/test-idx/_search?pretty=true -d '{
    "query": {
        "match": {
            "body": {
                "query": "set home page",
                "analyzer": "english"
            }
        }
    }
}'
echo

---

And the output of above:

Deleting the test index
{"ok":true,"acknowledged":true}
Creating the test index
{"ok":true,"acknowledged":true}
Setting mapping
{"error":"MapperParsingException[Failed to parse mapping definition]; nested: JsonParseException[Unexpected character ('}' (code 125)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name\n at [Source: \n{\n  \"content\": {\n    \"properties\": {\n      \"body\": {\n        \"type\": \"string\",\n        \"store\": \"yes\",\n        \"index\": \"analyzed\",\n      },\n    }\n  }\n}; line: 9, column: 8]]; ","status":400}
----------------------------------------------------------------------
Indexing an English document
{"ok":true,"_index":"test-idx","_type":"content","_id":"1","_version":1}Indexing a Spanish document
{"ok":true,"_index":"test-idx","_type":"content","_id":"2","_version":1}Indexing a Russian document
{"ok":true,"_index":"test-idx","_type":"content","_id":"3","_version":1}{"ok":true,"_shards":{"total":1,"successful":1,"failed":0}}
----------------------------------------------------------------------
List of docs
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test-idx",
      "_type" : "content",
      "_id" : "1",
      "_score" : 1.0, "_source" : {
  "body": "Setting your home page in Firefox is easy.",
  "_analyzer": "snowball"
}
    } ]
  },
  "facets" : {
    "terms1" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 6,
      "other" : 0,
      "terms" : [ {
        "term" : "your",
        "count" : 1
      }, {
        "term" : "set",
        "count" : 1
      }, {
        "term" : "page",
        "count" : 1
      }, {
        "term" : "home",
        "count" : 1
      }, {
        "term" : "firefox",
        "count" : 1
      }, {
        "term" : "easi",
        "count" : 1
      } ]
    }
  }
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test-idx",
      "_type" : "content",
      "_id" : "2",
      "_score" : 1.0, "_source" : {
  "body": "Establecer la página de inicio en Firefox es fácil.",
  "_analyzer": "spanish"
}
    } ]
  },
  "facets" : {
    "terms1" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 5,
      "other" : 0,
      "terms" : [ {
        "term" : "pagin",
        "count" : 1
      }, {
        "term" : "inici",
        "count" : 1
      }, {
        "term" : "firefox",
        "count" : 1
      }, {
        "term" : "facil",
        "count" : 1
      }, {
        "term" : "establecer",
        "count" : 1
      } ]
    }
  }
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test-idx",
      "_type" : "content",
      "_id" : "3",
      "_score" : 1.0, "_source" : {
  "body": "Установить домашнюю страницу в Firefox очень легко.",
  "_analyzer": "russian"
}
    } ]
  },
  "facets" : {
    "terms1" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 6,
      "other" : 0,
      "terms" : [ {
        "term" : "установ",
        "count" : 1
      }, {
        "term" : "страниц",
        "count" : 1
      }, {
        "term" : "очен",
        "count" : 1
      }, {
        "term" : "легк",
        "count" : 1
      }, {
        "term" : "домашн",
        "count" : 1
      }, {
        "term" : "firefox",
        "count" : 1
      } ]
    }
  }
}

----------------------------------------------------------------------
Ok, lets make some searches
Russian, search for домашню (which is less than a complete word in the document)
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5270494,
    "hits" : [ {
      "_index" : "test-idx",
      "_type" : "content",
      "_id" : "3",
      "_score" : 0.5270494, "_source" : {
  "body": "Установить домашнюю страницу в Firefox очень легко.",
  "_analyzer": "russian"
}
    } ]
  }
}
Spanish, searching for 'paginas'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.614891,
    "hits" : [ {
      "_index" : "test-idx",
      "_type" : "content",
      "_id" : "2",
      "_score" : 0.614891, "_source" : {
  "body": "Establecer la página de inicio en Firefox es fácil.",
  "_analyzer": "spanish"
}
    } ]
  }
}
English, seraching for 'home page'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.91287637,
    "hits" : [ {
      "_index" : "test-idx",
      "_type" : "content",
      "_id" : "1",
      "_score" : 0.91287637, "_source" : {
  "body": "Setting your home page in Firefox is easy.",
  "_analyzer": "snowball"
}
    } ]
  }
}
Oh, oops. I just noticed that the mapping section of the above example failed due to JSON errors. That just goes to show how natural this is even without setting a schema. Fixing the JSON error and rerunning results in the same result as above.
I went through all the languages we support for SUMO, learned a little about the language, and picked analyzers I think would be good for them. These fall into 3 categories:

1) There is a built in plugin that works well or can be configured to work well.
2) Languages which have interesting features, and so we should use a special analyzer, but there is not a built in specific analyzer (Japanese, Korean, Polish, Burmese).
3) Languages which have no specific analyzer, and for which the standard analyzer should work well.

Built in
========
Snowball
---------
Snowball has a setting to set it to various languages. Where available, I think
this is the best option. These languages are supported by snowball:

* Armenian (hy-AM)
* Basque (eu)
* Catalan (ca)
* Danish (da)
* Dutch (nl)
* English (en)
* Finnish (fi)
* French (fr)
* German (de)
* Hungarian (hu)
* Italian (it)
* Norwegian (Bokmål) (nb-NO)
* Norwegian (Nynorsk) (no)
* Portuguese (Brazilian) (pt-BR)
* Portuguese (Portugal) (pt-PT)
* Romanian (ro)
* Russian (ru)
* Spanish (es)
* Swedish (sv)
* Turkish (tr)

NB: All of those languages are supported by the lang analyzers below, but I
think that snowball is a better analyzer than the default.

Language Analyzers
------------------
ES has a pack of analyzer it calls "lang analyzers"
(http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/).
These languages are included in these addons:

* Arabic (ar)
* Bulgarian (bg)
* Chinese (Simplified) (zh-CN)
* Chinese (Traditional) (zh-TW)
* Czech (cs)
* Galician (gl)
* Greek (el)
* Hindi (India) (hi-IN)
* Indonesian (id)
* Persian (fa)
* Thai (th)


With Plugin
===========
* Polish (pl) - A plugin is available to add a Polish analyzer.
  https://github.com/elasticsearch/elasticsearch-analysis-stempel/tree/master/src/main


Special Consideration
=====================
* Burmese (my) - Has no spaces, and the CJK analyzer doesn't work. We need to
  build a custom analyzer.
* Japanese (ja) - Has no spaces, but CJK works fine, use that.
* Korean (ko) - Intuition says that this falls under the CJK analyzer, but I
  think the standard tokenizer would work better because modern Korean uses
  only Hangul characters and uses spaces between letters.


Other
=====
These don't have snowball analyzers, and don't have ES Lanuage analyzers that I
know of. However, the "standard" analyzer should work fine, since these are all
languages that use spaces as seperators for words. We won't get stemming or
stop words, but at least search shouldn't be worse than what we have now.

* Acholi (ach)
* Akan (ak)
* Albanian (sq)
* Assamese (as)
* Asturian (ast)
* Azerbaijani (az)
* Belarusian (be)
* Bengali (Bangladesh) (bn-BD)
* Bengali (India) (bn-IN)
* Bosnian (bs)
* Croatian (hr)
* Esperanto (eo)
* Estonian (et)
* Frisian (fy-NL)
* Friulian (fur)
* Fulah (ff)
* Gujarati (gu-IN)
* Hebrew (he)
* Icelandic (is)
* Iloko (ilo)
* Irish (Ireland) (ga-IE)
* Kannada (kn)
* Kazakh (kk)
* Khmer (km)
* Kinyarwanda (rw)
* Lithuanian (lt)
* Luganda (lg)
* Macedonian (mk)
* Maithili (mai)
* Malayalam (ml)
* Malay (ms)
* Marathi (mr)
* Mongolian (mn)
* Nepali (ne-NP)
* Northern Sotho (nso)
* Punjabi (pa-IN)
* Romansh (rm)
* Sakha (sah)
* Scottish Gaelic (gd)
* Serbian (sr-Cyrl)
* Serbian (sr-Latn)
* Sinhala (si)
* Slovak (sk)
* Slovenian (sl)
* Songhay (son)
* Swahili (sw)
* Tamil (Sri Lanka) (ta-LK)
* Tamil (ta)
* Telugu (te)
* Ukrainian (uk)
* Vietnamese (vi)
* Zulu (zu)


Again, words are cheap, and code is less ambiguous, so:

    ES_LOCALE_ANALYZER_MAP = {
        'ar': 'arabic',
        'hy-AM': 'snowball-armenian',
        'eu': 'snowball-basque',
        'bg': 'bulgarian',
        'my': 'custom',
        'ca': 'snowball-catalan',
        'zh-CN': 'chinese',
        'zh-TW': 'chinese',
        'cs': 'czech',
        'da': 'snowball-danish',
        'nl': 'snowball-dutch',
        'en': 'snowball-english',
        'fi': 'snowball-finnish',
        'fr': 'snowball-french',
        'gl': 'galician',
        'de': 'snowball-german',
        'hi-IN': 'hindi',
        'hu': 'snowball-hungarian',
        'id': 'indonesian',
        'it': 'snowball-italian',
        'ja': 'cjk',
        'nb-NO': 'snowball-norwegian',
        'no': 'snowball-norwegian',
        'fa': 'persian',
        'pl': 'polish',
        'pt-BR': 'snowball-protugese',
        'pt-PT': 'snowball-portugese',
        'ro': 'snowball-romanian',
        'ru': 'snowball-russian',
        'es': 'snowball-spanish',
        'sv': 'snowball-swedish',
        'th': 'thai',
        'tr': 'snowball-turkish',
    }
Blocks: 778437
I'm not caught up on comments yet, but bug 778437 is related to this.
I like this proposal. I think we're in good shape with the languages supported.

I have some questions on how things will work.

* I assume you'll be doing the same thing for questions? We don't do any locale filtering when searching questions (the content is only en-US and pt-BR for now, but we show them in results for all locales). Not sure if we need to do anything or if it will even be affected. There is also bug 885092, so we might not even include questions in the search results for locales we don't support in the Support Forum. In that case, we don't have to do anything different from wiki.

* What happens when you query with a different analyzer than was specified in the document when indexing? Can you still search across the entire index even if the analyzer doesn't match? Does it affect the scores? What happens if I query with `"query": "set home page", "analyzer": "spanish"`?

That's all for now. This is exciting!
Regarding Ricky's questions:

1. I think we should start with the wiki where we know the language settings for documents is correct. We can do questions and other things in a different bug.

2. The example you gave is what we're doing right now except in reverse--queries that are in Spanish are being analyzed with a Snowball English analyzer which does a few things some of which work (tokenizing on spaces, dropping some punctuation, ...) and some of which don't really do much (de-stemming using English rules, English stop words, ...).
I think the research here is really good and I can't think of anything that isn't covered here already.

I think it's worth spinning off an implementation bug that covers the changes that need to be made.
(In reply to Will Kahn-Greene [:willkg] from comment #13)
> I think it's worth spinning off an implementation bug that covers the
> changes that need to be made.

+1 bd \o/
No longer blocks: 778437
Filed bug 894686 (implementation), and bug 894649 (IT bug for the Polish plugin) to get this done next sprint.

\o/
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 793762, 708889
You need to log in before you can comment on or make changes to this bug.