635837 - Improve search quality for queries including non-ascii characters

I've investigated in further detail, and found some issues in zamboni and elasticsearch. In zamboni, - snowball and standardPlusWordDelimiter are used for index analyzer, but those analyzers should be language analyzer[1] depends on localized text. - standard analyzer is used for query string, but it also should be language analyzer depends on input text. [1] http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html In elastic search, - there is a bug 'cjk' language analyzer. Though I've sent a pull-request to upstream[2] for this issue, I'd recommend apply the fix to elasticsearch on AMO sever if the elasticsearch on the server is built by Mozilla. [2] https://github.com/elasticsearch/elasticsearch/pull/1415

Hiroyuki Ikezoe (:hiro)

Assignee

Comment 4

•

14 years ago

Attached patch Proposed fix (obsolete) — Details — Splinter Review

Creating indices for each language and use language specific analyzer for input text based on locale (It is hard to detect the language depends on input text itself).

Wil Clouser [:clouserw]

Updated

•

14 years ago

Attachment #568583 - Flags: review?(jbalogh)

Jeff Balogh (:jbalogh)

Comment 5

•

14 years ago

Comment on attachment 568583 [details] [diff] [review] Proposed fix First of all, thanks for taking a look at this! Can you provide some examples where the current analyzers are failing? The analyzer is biased towards english, but I don't think the stemming is very important since most people search for addons by name (according to the logs). It would help to have a good collection of examples when people go to change how the search engine works. A couple nits to get out of the way: please follow pep8 (particularly, keeping lines under 80 characters) and use more descriptive variable names. You're using _ as a variable name in the following block and it's hard to tell what that's actually referring to. >+ >+ # Indices for each language >+ for analyzer in ANALYZER_LIST: >+ languages = [l for l, a in ANALYZER_MAP.iteritems() if a == analyzer] It looks like you actually want ANALYZER_MAP to be a mapping from language => [language codes], so you'd have (for example) {'cjk': ['ko', 'ja']}. >+ d['name_' + analyzer] = list(set(string for _, string >+ in translations[addon.name_id] if _.lower() in languages)) Can you use "locale" instead of "_" as the variable name? >+ for analyzer in ANALYZER_LIST: >+ mapping['properties']['name_' + analyzer] = { >+ 'type': 'string', >+ 'analyzer': analyzer, >+ } I don't see these analyzers in http://www.elasticsearch.org/guide/reference/index-modules/analysis/. Where do they come from? >--- a/apps/search/__init__.py >+++ b/apps/search/__init__.py >@@ -0,0 +1,40 @@ Can you add comments on where the keys and values come from and how it's supposed to be used? And why are some rows commented out? >+ANALYZER_MAP = { >+# 'af': '', # Afrikaans >+ 'ar': 'arabic', >+ 'bg': 'bulgarian', >+ >+ANALYZER_LIST = set(analyzer for analyzer in ANALYZER_MAP.values()) set(ANALYZER_MAP.values()) does the same thing. >diff --git a/apps/search/views.py b/apps/search/views.py >index e7c5681..a01fffc 100644 >--- a/apps/search/views.py >+++ b/apps/search/views.py >@@ -4,6 +4,7 @@ from django.conf import settings > from django.db.models import Q > from django.shortcuts import redirect > from django.utils.encoding import smart_str >+from django.utils import translation > > import commonware.log > import jingo >@@ -11,6 +12,7 @@ from tower import ugettext as _ > from mobility.decorators import mobile_template > > import amo >+from search import ANALYZER_MAP > import bandwagon.views > import browse.views > from addons.models import Addon, Category >@@ -374,21 +376,43 @@ def ajax_search_suggestions(request): > > return results > >+def _get_analyzer(): >+ language = translation.get_language() >+ if language in ANALYZER_MAP.keys(): >+ return ANALYZER_MAP[language] >+ return 'standard' This can be written as `ANALYZER_MAP.get(language, 'standard')`. > > def name_only_query(q): >- return dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'}, >- name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4}, >- name__startswith={'value': q, 'boost': 1.5}) >+ d = dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'}, >+ name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4}, >+ name__startswith={'value': q, 'boost': 1.5}) >+ >+ analyzer = _get_analyzer() >+ d['name_' + analyzer + '__text'] = {'query': q, 'boost': 2.5, 'analyzer': analyzer} >+ return d If this returns "standard" we shouldn't be adding another the extra (redundant) query. >+ >+ analyzer = _get_analyzer() >+ more['summary_' + analyzer + '__text'] = {'query': q, >+ 'boost': 0.6, >+ 'type': 'phase', Should this be "phrase"? >1.7.4.1 > How can we test that this works and that we don't regress it? You don't have to write the tests, but some examples that we can turn into test cases would be great.

Jeff Balogh (:jbalogh)

Updated

•

14 years ago

Assignee: jbalogh → hiikezoe

Hiroyuki Ikezoe (:hiro)

Assignee

Comment 6

•

14 years ago

(In reply to Jeff Balogh (:jbalogh) from comment #5) > Comment on attachment 568583 [details] [diff] [review] [diff] [details] [review] > Proposed fix > > First of all, thanks for taking a look at this! Can you provide some > examples where the current analyzers are failing? The analyzer is biased > towards english, but I don't think the stemming is very important since most > people search for addons by name (according to the logs). It would help to > have a good collection of examples when people go to change how the search > engine works. A typical failure case is collocation strings with KATAKANA words. For example, "テキストリンク" (it means "text link" in English) consists of "テキスト" and "リンク", but the current analyzer stores these strings as a word in index tables. So "テキストリンク" can be searched but "リンク" can not be searched. Stemming is not a problem at least in the cases of CJK here. > A couple nits to get out of the way: please follow pep8 (particularly, > keeping lines under 80 characters) and use more descriptive variable names. > You're using _ as a variable name in the following block and it's hard to > tell what that's actually referring to. Yes, you are right. I actually love descriptive name. > >+ > >+ # Indices for each language > >+ for analyzer in ANALYZER_LIST: > >+ languages = [l for l, a in ANALYZER_MAP.iteritems() if a == analyzer] > > It looks like you actually want ANALYZER_MAP to be a mapping from language > => [language codes], so you'd have (for example) {'cjk': ['ko', 'ja']}. We need two mappings. One is the current ANALYZER_MAP used in apps/search/view.py. The other is exactly the mapping you are mentioning. You mean the other mapping should also be created in apps/search/__init__py? I do not know how to create the mapping smart in Python. > >+ d['name_' + analyzer] = list(set(string for _, string > >+ in translations[addon.name_id] if _.lower() in languages)) > > Can you use "locale" instead of "_" as the variable name? Yes, sure. > >+ for analyzer in ANALYZER_LIST: > >+ mapping['properties']['name_' + analyzer] = { > >+ 'type': 'string', > >+ 'analyzer': analyzer, > >+ } > > I don't see these analyzers in > http://www.elasticsearch.org/guide/reference/index-modules/analysis/. Where > do they come from? Yes, the document is little bit confusable. I had also misunderstood the language analyzer first. 'language' is actually not a valid analyzer name, each language analyzer name (listed in [1]) has to be set. [1] http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html > >--- a/apps/search/__init__.py > >+++ b/apps/search/__init__.py > >@@ -0,0 +1,40 @@ > > Can you add comments on where the keys and values come from and how it's > supposed to be used? And why are some rows commented out? Yes, I will. > >+ANALYZER_MAP = { > >+# 'af': '', # Afrikaans > >+ 'ar': 'arabic', > >+ 'bg': 'bulgarian', > > >+ > >+ANALYZER_LIST = set(analyzer for analyzer in ANALYZER_MAP.values()) > > set(ANALYZER_MAP.values()) does the same thing. > > >diff --git a/apps/search/views.py b/apps/search/views.py > >index e7c5681..a01fffc 100644 > >--- a/apps/search/views.py > >+++ b/apps/search/views.py > >@@ -4,6 +4,7 @@ from django.conf import settings > > from django.db.models import Q > > from django.shortcuts import redirect > > from django.utils.encoding import smart_str > >+from django.utils import translation > > > > import commonware.log > > import jingo > >@@ -11,6 +12,7 @@ from tower import ugettext as _ > > from mobility.decorators import mobile_template > > > > import amo > >+from search import ANALYZER_MAP > > import bandwagon.views > > import browse.views > > from addons.models import Addon, Category > >@@ -374,21 +376,43 @@ def ajax_search_suggestions(request): > > > > return results > > > >+def _get_analyzer(): > >+ language = translation.get_language() > >+ if language in ANALYZER_MAP.keys(): > >+ return ANALYZER_MAP[language] > >+ return 'standard' > > This can be written as `ANALYZER_MAP.get(language, 'standard')`. Thanks. I will fix it. I am actually very unfamiliar with Python.;-) > > def name_only_query(q): > >- return dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'}, > >- name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4}, > >- name__startswith={'value': q, 'boost': 1.5}) > >+ d = dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'}, > >+ name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4}, > >+ name__startswith={'value': q, 'boost': 1.5}) > >+ > >+ analyzer = _get_analyzer() > >+ d['name_' + analyzer + '__text'] = {'query': q, 'boost': 2.5, 'analyzer': analyzer} > >+ return d > > If this returns "standard" we shouldn't be adding another the extra > (redundant) query. Ah, yes exactly. I will fix it. > >+ analyzer = _get_analyzer() > >+ more['summary_' + analyzer + '__text'] = {'query': q, > >+ 'boost': 0.6, > >+ 'type': 'phase', > > Should this be "phrase"? Thanks. My mistakes. > How can we test that this works and that we don't regress it? You don't have > to write the tests, but some examples that we can turn into test cases would > be great. I actually wanted to write some tests, but I can find where the tests should be in.

Hiroyuki Ikezoe (:hiro)

Assignee

Comment 7

•

14 years ago

> > How can we test that this works and that we don't regress it? You don't have > > to write the tests, but some examples that we can turn into test cases would > > be great. > > I actually wanted to write some tests, but I can find where the tests should > be in. I meant I can *not* find, of course.

Hiroyuki Ikezoe (:hiro)

Assignee

Comment 8

•

14 years ago

Attached patch Revised patch — Details — Splinter Review

Addressed jbalogh's comment and created two mappings in apps/search/__init__.py. I am sorry that this patch does not include any tests.

Attachment #568583 - Attachment is obsolete: true

Attachment #568583 - Flags: review?(jbalogh)

Attachment #569001 - Flags: review?(jbalogh)

Jeff Balogh (:jbalogh)

Comment 9

•

14 years ago

Thanks Hiro. This patch plus some extra stuff from me was merged in https://github.com/jbalogh/zamboni/commit/82779b1, which included https://github.com/mozilla/zamboni/commit/0d22216 and https://github.com/mozilla/zamboni/commit/062e8a61f. I'm reindexing all the add-ons right now. Can you see if the changes improve search quality? URLs comparing -dev (https://addons-dev.allizom.org/en-US/firefox/search/) and production (https://addons.mozilla.org/en-US/firefox/search/) would be great.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Hiroyuki Ikezoe (:hiro)

Assignee

Comment 10

•

14 years ago

Thank you, Jeff! (In reply to Jeff Balogh (:jbalogh) from comment #9) > Thanks Hiro. This patch plus some extra stuff from me was merged in > https://github.com/jbalogh/zamboni/commit/82779b1, which included > https://github.com/mozilla/zamboni/commit/0d22216 and > https://github.com/mozilla/zamboni/commit/062e8a61f. > > I'm reindexing all the add-ons right now. Can you see if the changes improve > search quality? URLs comparing -dev > (https://addons-dev.allizom.org/en-US/firefox/search/) and production > (https://addons.mozilla.org/en-US/firefox/search/) would be great. I checked both sites and the search results seem better on the name field. But the results of searching in the summary field seem wrong.. I found the following mistake. ;-) "__test" must be "__text" at the line of https://github.com/jbalogh/zamboni/commit/82779b1#L2R437 Anyway, thank you for your quick work!

Jeff Balogh (:jbalogh)

Comment 11

•

14 years ago

https://github.com/jbalogh/zamboni/commit/573d487

Jeff Balogh (:jbalogh)

Updated

•

14 years ago

Attachment #569001 - Flags: review?(jbalogh)

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: addons.mozilla.org → addons.mozilla.org Graveyard

Proposed fix 14 years ago Hiroyuki Ikezoe (:hiro) 6.67 KB, patch		Details \| Diff \| Splinter Review
Revised patch 14 years ago Hiroyuki Ikezoe (:hiro) 7.80 KB, patch		Details \| Diff \| Splinter Review

Bugzilla

Improve search quality for queries including non-ascii characters

Categories

(addons.mozilla.org Graveyard :: Search, defect)

Tracking

(Not tracked)

People

(Reporter: kohei, Assigned: hiro)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Updated

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Attachment

General

Description

File Name

Content Type