Closed Bug 635837 Opened 13 years ago Closed 13 years ago

Improve search quality for queries including non-ascii characters

Categories

(addons.mozilla.org Graveyard :: Search, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Q4 2011

People

(Reporter: kohei, Assigned: hiro)

References

Details

Attachments

(1 file, 1 obsolete file)

The current multibyte support of the AMO search engine is limited. I've heard it uses Trigram; it works for en-US, but is not suited for some languages.

Example 1: search "マーカー" (marker) returns 2 results

https://addons.mozilla.org/ja/firefox/search/?q=%E3%83%9E%E3%83%BC%E3%82%AB%E3%83%BC

but actual results (manually collected) are 5:

https://addons.mozilla.org/ja/firefox/addon/line-marker/
https://addons.mozilla.org/ja/firefox/addon/wired-marker/
https://addons.mozilla.org/ja/firefox/addon/hyper-anchor/
https://addons.mozilla.org/ja/firefox/addon/xulmigemo/
https://addons.mozilla.org/ja/firefox/addon/scrapbook/

Example 2: search tab (タブ) returns 15 results:
https://addons.mozilla.org/ja/firefox/search/?q=%E3%82%BF%E3%83%96

but there are more add-ons with localized summary/description including タブ. 

(Lots of localized add-ons don't have localized summary/description but it doesn't matter here. This bug intends to improve the search quality.)

I've made a Google custom engine for AMO/ja, so you can quickly test the quality:

http://www.google.com/cse/home?cx=004119830362093040131%3A0hn1r0uqptc
Summary: Search: support multibyte characters → Improve search quality for queries including non-ascii characters
Bug 669383 has some explanation of why this is happening, but I'm marking it as a dupe here since it was later.
Assignee: nobody → jbalogh
Target Milestone: --- → Q4 2011
I've investigated in further detail, and found some issues in zamboni and elasticsearch.

In zamboni,
 - snowball and standardPlusWordDelimiter are used for index analyzer, but those analyzers should be language analyzer[1] depends on localized text.
 - standard analyzer is used for query string, but it also should be language analyzer depends on input text.

[1] http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html

In elastic search,
 - there is a bug 'cjk' language analyzer. Though I've sent a pull-request to upstream[2] for this issue, I'd recommend apply the fix to elasticsearch on AMO sever if the elasticsearch on the server is built by Mozilla.

[2] https://github.com/elasticsearch/elasticsearch/pull/1415
Attached patch Proposed fix (obsolete) — Splinter Review
Creating indices for each language and use language specific analyzer for input text based on locale (It is hard to detect the language depends on input text itself).
Attachment #568583 - Flags: review?(jbalogh)
Comment on attachment 568583 [details] [diff] [review]
Proposed fix

First of all, thanks for taking a look at this! Can you provide some examples where the current analyzers are failing? The analyzer is biased towards english, but I don't think the stemming is very important since most people search for addons by name (according to the logs). It would help to have a good collection of examples when people go to change how the search engine works.

A couple nits to get out of the way: please follow pep8 (particularly, keeping lines under 80 characters) and use more descriptive variable names. You're using _ as a variable name in the following block and it's hard to tell what that's actually referring to.

>+
>+    # Indices for each language
>+    for analyzer in ANALYZER_LIST:
>+        languages = [l for l, a in ANALYZER_MAP.iteritems() if a == analyzer]

It looks like you actually want ANALYZER_MAP to be a mapping from language => [language codes], so you'd have (for example) {'cjk': ['ko', 'ja']}.

>+        d['name_' + analyzer] = list(set(string for _, string
>+                                         in translations[addon.name_id] if _.lower() in languages))

Can you use "locale" instead of "_" as the variable name?

>+    for analyzer in ANALYZER_LIST:
>+        mapping['properties']['name_' + analyzer] = {
>+            'type': 'string',
>+            'analyzer': analyzer,
>+        }

I don't see these analyzers in http://www.elasticsearch.org/guide/reference/index-modules/analysis/. Where do they come from?

>--- a/apps/search/__init__.py
>+++ b/apps/search/__init__.py
>@@ -0,0 +1,40 @@

Can you add comments on where the keys and values come from and how it's supposed to be used? And why are some rows commented out?

>+ANALYZER_MAP = {
>+#    'af': '', # Afrikaans
>+    'ar': 'arabic',
>+    'bg': 'bulgarian',

>+
>+ANALYZER_LIST = set(analyzer for analyzer in ANALYZER_MAP.values())

set(ANALYZER_MAP.values()) does the same thing. 

>diff --git a/apps/search/views.py b/apps/search/views.py
>index e7c5681..a01fffc 100644
>--- a/apps/search/views.py
>+++ b/apps/search/views.py
>@@ -4,6 +4,7 @@ from django.conf import settings
> from django.db.models import Q
> from django.shortcuts import redirect
> from django.utils.encoding import smart_str
>+from django.utils import translation
> 
> import commonware.log
> import jingo
>@@ -11,6 +12,7 @@ from tower import ugettext as _
> from mobility.decorators import mobile_template
> 
> import amo
>+from search import ANALYZER_MAP
> import bandwagon.views
> import browse.views
> from addons.models import Addon, Category
>@@ -374,21 +376,43 @@ def ajax_search_suggestions(request):
> 
>     return results
> 
>+def _get_analyzer():
>+    language = translation.get_language()
>+    if language in ANALYZER_MAP.keys():
>+        return ANALYZER_MAP[language]
>+    return 'standard'

This can be written as `ANALYZER_MAP.get(language, 'standard')`.

> 
> def name_only_query(q):
>-    return dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'},
>-                name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4},
>-                name__startswith={'value': q, 'boost': 1.5})
>+    d = dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'},
>+             name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4},
>+             name__startswith={'value': q, 'boost': 1.5})
>+
>+    analyzer = _get_analyzer()
>+    d['name_' + analyzer + '__text'] = {'query': q, 'boost': 2.5, 'analyzer': analyzer}
>+    return d

If this returns "standard" we shouldn't be adding another the extra (redundant) query.

>+
>+    analyzer = _get_analyzer()
>+    more['summary_' + analyzer + '__text'] = {'query': q,
>+                                              'boost': 0.6,
>+                                              'type': 'phase',

Should this be "phrase"?

>1.7.4.1
>

How can we test that this works and that we don't regress it? You don't have to write the tests, but some examples that we can turn into test cases would be great.
Assignee: jbalogh → hiikezoe
(In reply to Jeff Balogh (:jbalogh) from comment #5)
> Comment on attachment 568583 [details] [diff] [review] [diff] [details] [review]
> Proposed fix
> 
> First of all, thanks for taking a look at this! Can you provide some
> examples where the current analyzers are failing? The analyzer is biased
> towards english, but I don't think the stemming is very important since most
> people search for addons by name (according to the logs). It would help to
> have a good collection of examples when people go to change how the search
> engine works.


A typical failure case is collocation strings with KATAKANA words.
For example, "テキストリンク" (it means "text link" in English) consists of 
"テキスト" and "リンク", but the current analyzer stores these strings as a word in index tables.
So "テキストリンク" can be searched but "リンク" can not be searched.  Stemming is not a problem at least in the cases of CJK here.


> A couple nits to get out of the way: please follow pep8 (particularly,
> keeping lines under 80 characters) and use more descriptive variable names.
> You're using _ as a variable name in the following block and it's hard to
> tell what that's actually referring to.

Yes, you are right. I actually love descriptive name.

> >+
> >+    # Indices for each language
> >+    for analyzer in ANALYZER_LIST:
> >+        languages = [l for l, a in ANALYZER_MAP.iteritems() if a == analyzer]
> 
> It looks like you actually want ANALYZER_MAP to be a mapping from language
> => [language codes], so you'd have (for example) {'cjk': ['ko', 'ja']}.

We need two mappings.  One is the current ANALYZER_MAP used in apps/search/view.py.
The other is exactly the mapping you are mentioning.

You mean the other mapping should also be created in apps/search/__init__py? I do not know how to create the mapping smart in Python. 

> >+        d['name_' + analyzer] = list(set(string for _, string
> >+                                         in translations[addon.name_id] if _.lower() in languages))
> 
> Can you use "locale" instead of "_" as the variable name?

Yes, sure.

> >+    for analyzer in ANALYZER_LIST:
> >+        mapping['properties']['name_' + analyzer] = {
> >+            'type': 'string',
> >+            'analyzer': analyzer,
> >+        }
> 
> I don't see these analyzers in
> http://www.elasticsearch.org/guide/reference/index-modules/analysis/. Where
> do they come from?

Yes, the document is little bit confusable. I had also misunderstood the language analyzer first. 'language' is actually not a valid analyzer name, each language analyzer name (listed in [1]) has to be set.

[1] http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html

> >--- a/apps/search/__init__.py
> >+++ b/apps/search/__init__.py
> >@@ -0,0 +1,40 @@
> 
> Can you add comments on where the keys and values come from and how it's
> supposed to be used? And why are some rows commented out?

Yes, I will.

> >+ANALYZER_MAP = {
> >+#    'af': '', # Afrikaans
> >+    'ar': 'arabic',
> >+    'bg': 'bulgarian',
> 
> >+
> >+ANALYZER_LIST = set(analyzer for analyzer in ANALYZER_MAP.values())
> 
> set(ANALYZER_MAP.values()) does the same thing. 
> 
> >diff --git a/apps/search/views.py b/apps/search/views.py
> >index e7c5681..a01fffc 100644
> >--- a/apps/search/views.py
> >+++ b/apps/search/views.py
> >@@ -4,6 +4,7 @@ from django.conf import settings
> > from django.db.models import Q
> > from django.shortcuts import redirect
> > from django.utils.encoding import smart_str
> >+from django.utils import translation
> > 
> > import commonware.log
> > import jingo
> >@@ -11,6 +12,7 @@ from tower import ugettext as _
> > from mobility.decorators import mobile_template
> > 
> > import amo
> >+from search import ANALYZER_MAP
> > import bandwagon.views
> > import browse.views
> > from addons.models import Addon, Category
> >@@ -374,21 +376,43 @@ def ajax_search_suggestions(request):
> > 
> >     return results
> > 
> >+def _get_analyzer():
> >+    language = translation.get_language()
> >+    if language in ANALYZER_MAP.keys():
> >+        return ANALYZER_MAP[language]
> >+    return 'standard'
> 
> This can be written as `ANALYZER_MAP.get(language, 'standard')`.

Thanks. I will fix it. I am actually very unfamiliar with Python.;-)

> > def name_only_query(q):
> >-    return dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'},
> >-                name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4},
> >-                name__startswith={'value': q, 'boost': 1.5})
> >+    d = dict(name__text={'query': q, 'boost': 3, 'analyzer': 'standard'},
> >+             name__fuzzy={'value': q, 'boost': 2, 'prefix_length': 4},
> >+             name__startswith={'value': q, 'boost': 1.5})
> >+
> >+    analyzer = _get_analyzer()
> >+    d['name_' + analyzer + '__text'] = {'query': q, 'boost': 2.5, 'analyzer': analyzer}
> >+    return d
> 
> If this returns "standard" we shouldn't be adding another the extra
> (redundant) query.

Ah, yes exactly. I will fix it.

> >+    analyzer = _get_analyzer()
> >+    more['summary_' + analyzer + '__text'] = {'query': q,
> >+                                              'boost': 0.6,
> >+                                              'type': 'phase',
> 
> Should this be "phrase"?

Thanks. My mistakes.

> How can we test that this works and that we don't regress it? You don't have
> to write the tests, but some examples that we can turn into test cases would
> be great.

I actually wanted to write some tests, but I can find where the tests should be in.
> > How can we test that this works and that we don't regress it? You don't have
> > to write the tests, but some examples that we can turn into test cases would
> > be great.
> 
> I actually wanted to write some tests, but I can find where the tests should
> be in.

I meant I can *not* find, of course.
Attached patch Revised patchSplinter Review
Addressed jbalogh's comment and created two mappings in apps/search/__init__.py.

I am sorry that this patch does not include any tests.
Attachment #568583 - Attachment is obsolete: true
Attachment #568583 - Flags: review?(jbalogh)
Attachment #569001 - Flags: review?(jbalogh)
Thanks Hiro. This patch plus some extra stuff from me was merged in https://github.com/jbalogh/zamboni/commit/82779b1, which included https://github.com/mozilla/zamboni/commit/0d22216 and https://github.com/mozilla/zamboni/commit/062e8a61f.

I'm reindexing all the add-ons right now. Can you see if the changes improve search quality? URLs comparing -dev (https://addons-dev.allizom.org/en-US/firefox/search/) and production (https://addons.mozilla.org/en-US/firefox/search/) would be great.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Thank you, Jeff!

(In reply to Jeff Balogh (:jbalogh) from comment #9)
> Thanks Hiro. This patch plus some extra stuff from me was merged in
> https://github.com/jbalogh/zamboni/commit/82779b1, which included
> https://github.com/mozilla/zamboni/commit/0d22216 and
> https://github.com/mozilla/zamboni/commit/062e8a61f.
> 
> I'm reindexing all the add-ons right now. Can you see if the changes improve
> search quality? URLs comparing -dev
> (https://addons-dev.allizom.org/en-US/firefox/search/) and production
> (https://addons.mozilla.org/en-US/firefox/search/) would be great.

I checked both sites and the search results seem better on the name field. But the results of searching in the summary field seem wrong..

I found the following mistake. ;-)

"__test" must be "__text"  at the line of https://github.com/jbalogh/zamboni/commit/82779b1#L2R437

Anyway, thank you for your quick work!
Attachment #569001 - Flags: review?(jbalogh)
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: