Last Comment Bug 232612 - enable boolean mode fulltext searches
: enable boolean mode fulltext searches
Status: RESOLVED FIXED
[requires MySQL 4.0.1 or higher]
:
Product: Bugzilla
Classification: Server Software
Component: Query/Bug List (show other bugs)
: unspecified
: All All
: -- enhancement (vote)
: Bugzilla 2.22
Assigned To: Myk Melez [:myk] [@mykmelez]
: default-qa
:
Mentors:
http://www.mysql.com/doc/en/Fulltext_...
Depends on: 204217
Blocks: 300548
  Show dependency treegraph
 
Reported: 2004-01-29 18:46 PST by Myk Melez [:myk] [@mykmelez]
Modified: 2012-12-18 20:46 PST (History)
1 user (show)
myk: approval+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
patch v1: implements feature (1.35 KB, patch)
2004-02-05 14:07 PST, Myk Melez [:myk] [@mykmelez]
no flags Details | Diff | Splinter Review
patch v2: fix for syntax error and indentation (1.35 KB, patch)
2004-02-05 14:16 PST, Myk Melez [:myk] [@mykmelez]
LpSolit: review+
Details | Diff | Splinter Review
patch v3: unrotted (2.13 KB, patch)
2005-07-11 16:44 PDT, Myk Melez [:myk] [@mykmelez]
LpSolit: review+
Details | Diff | Splinter Review
patch v4: does what Max suggests (4.65 KB, patch)
2005-07-12 13:56 PDT, Myk Melez [:myk] [@mykmelez]
no flags Details | Diff | Splinter Review
patch v5: no need to trick taint about @words (4.52 KB, patch)
2005-07-12 17:43 PDT, Myk Melez [:myk] [@mykmelez]
LpSolit: review+
Details | Diff | Splinter Review

Description Myk Melez [:myk] [@mykmelez] 2004-01-29 18:46:01 PST
Boolean mode fulltext searches provide advanced options for finding bugs and
should be turned on in Bugzilla once we require MySQL 4.0 (the first version of
MySQL to support them).  They are worse at determining relevance, however, so we
may not want to turn them on by default but rather have an option for turning
them on (either explicitly or by scanning the search string for boolean prefixes).
Comment 1 Myk Melez [:myk] [@mykmelez] 2004-02-05 14:07:30 PST
Created attachment 140693 [details] [diff] [review]
patch v1: implements feature
Comment 2 Myk Melez [:myk] [@mykmelez] 2004-02-05 14:16:24 PST
Created attachment 140694 [details] [diff] [review]
patch v2: fix for syntax error and indentation
Comment 3 David Lawrence [:dkl] 2004-07-02 10:15:27 PDT
Very MySQL specific. But we knew that already ;) I will try to implement this
somehow in PostgreSQL.
Comment 4 Dave Miller [:justdave] (justdave@bugzilla.org) 2004-09-18 17:58:54 PDT
Bugzilla 2.20 feature set is now frozen as of 15 Sept 2004.  Anything flagged
enhancement that hasn't already landed is being pushed out.  If this bug is
otherwise ready to land, we'll handle it on a case-by-case basis, please set the
blocking2.20 flag to '?' if you think it qualifies.
Comment 5 Myk Melez [:myk] [@mykmelez] 2004-12-13 18:53:08 PST
Comment on attachment 140694 [details] [diff] [review]
patch v2: fix for syntax error and indentation

Dave, can you review this simple, simple patch?
Comment 6 Dave Miller [:justdave] (justdave@bugzilla.org) 2005-02-15 09:01:51 PST
/me notes the dependencies on this bug.

Feature requires MySQL 4.0.1 or newer.

We have two options.
1) Start requiring MySQL 4.0.1 or newer (it's probably a bit premature to
require anything newer than 3.23.x right now because of distribution
difficulties, but that'll *very* likely change within the next year)
2) Check the server version, and only attempt to use this feature if the
installed server is new enough for it.
Comment 7 Frédéric Buclin 2005-02-15 15:00:11 PST
Comment on attachment 140694 [details] [diff] [review]
patch v2: fix for syntax error and indentation

Tested on my installation (with MySQL 4.0.20) and it works fine. So r=LpSolit.

But we have to increase the minimal version of MySQL to 4.0.1 before checking
in this patch!
Comment 8 Myk Melez [:myk] [@mykmelez] 2005-07-11 16:44:54 PDT
Created attachment 188992 [details] [diff] [review]
patch v3: unrotted

This patch unrots the previous one, since the way we handle boolean mode
searches has changed.
Comment 9 Max Kanat-Alexander 2005-07-11 19:45:10 PDT
Comment on attachment 188992 [details] [diff] [review]
patch v3: unrotted

>+            # Figure out if we're dealing with a boolean search.
>+            my $boolean_mode = ($v =~ /[+-<>()~*"]/);

  Why don't we just do that inside of Bugzilla::DB::Mysql's
sql_fulltext_search? Then we'll get it whenever we fulltext-search anything.
Comment 10 Frédéric Buclin 2005-07-11 20:13:34 PDT
Comment on attachment 188992 [details] [diff] [review]
patch v3: unrotted

$boolean_mode description has to be added into the POD docs of DB.pm.

Your patch seems to work as expected, based on the few tests I did. So
r=LpSolit assuming you will post a new patch with the $boolean_mode description
included. (carry forward my r+)
Comment 11 Myk Melez [:myk] [@mykmelez] 2005-07-12 13:56:22 PDT
Created attachment 189094 [details] [diff] [review]
patch v4: does what Max suggests

> Why don't we just do that inside of Bugzilla::DB::Mysql's
> sql_fulltext_search? Then we'll get it whenever we fulltext-search anything.

Good point.  Done.  Frederic, this means that the $boolean_mode parameter is no
more, so I haven't added it to the POD documentation.  It also means that patch
is sufficiently different from the way it used to be that it needs another
review.

In particular, the code now SQL quotes the text inside sql_fulltext_search()
rather than in Search.pm, since boolean mode needs to be determined from the
unquoted string (otherwise we'd think all searches were in boolean mode on a
database that quotes with double-quotes).

The new approach is also cleaner, since we're quoting closer to where the
quotes are actually needed, and simpler, since we no longer need to unquote the
string before using it in the ANSI version of the method.
Comment 12 Myk Melez [:myk] [@mykmelez] 2005-07-12 17:43:40 PDT
Created attachment 189127 [details] [diff] [review]
patch v5: no need to trick taint about @words 

We don't need to trick taint about @words, since we derive @words via a regexp,
which suffices to detaint them.
Comment 13 Frédéric Buclin 2005-07-14 18:34:41 PDT
Comment on attachment 189127 [details] [diff] [review]
patch v5: no need to trick taint about @words 

I did some searches using both the "Find a specific bug" and "Advanced Search".
Here are the typical SQL statements I get:


Find a specific bug (MySQL):

SELECT bugs.bug_id, bugs.bug_severity, bugs.priority, bugs.bug_status,
bugs.resolution, map_assigned_to.login_name, map_reporter.login_name,
bugs.bug_status, bugs.resolution, bugs.short_desc,
(SUM(MATCH(longdescs_0.thetext) AGAINST('test >assignee' IN BOOLEAN
MODE))/COUNT(MATCH(longdescs_0.thetext) AGAINST('test >assignee' IN BOOLEAN
MODE)) + MATCH(bugs.short_desc) AGAINST('test >assignee' IN BOOLEAN MODE)) AS
relevance FROM bugs INNER JOIN profiles AS map_assigned_to ON (bugs.assigned_to
= map_assigned_to.userid) INNER JOIN profiles AS map_reporter ON (bugs.reporter
= map_reporter.userid) INNER JOIN longdescs AS longdescs_0 ON (bugs.bug_id =
longdescs_0.bug_id AND longdescs_0.isprivate < 1) LEFT JOIN bug_group_map ON
bug_group_map.bug_id = bugs.bug_id WHERE ((bugs.bug_status IN
('UNCONFIRMED','NEW','ASSIGNED','REOPENED'))) AND (((MATCH(longdescs_0.thetext)
AGAINST('test >assignee' IN BOOLEAN MODE) > 0) OR (bugs.short_desc REGEXP
'(^|[^a-z0-9])test($|[^a-z0-9])' AND bugs.short_desc REGEXP
'(^|[^a-z0-9])\\>assignee($|[^a-z0-9])'))) AND bugs.creation_ts IS NOT NULL AND
((bug_group_map.group_id IS NULL)) GROUP BY bugs.bug_id ORDER BY relevance
LIMIT 200



Find a specific bug (non-MySQL):

SELECT bugs.bug_id, bugs.bug_severity, bugs.priority, bugs.bug_status,
bugs.resolution, map_assigned_to.login_name, map_reporter.login_name,
bugs.bug_status, bugs.resolution, bugs.short_desc, (SUM(CASE WHEN
(LOWER(longdescs_0.thetext) LIKE '%test%' AND LOWER(longdescs_0.thetext) LIKE
'%>assignee%') THEN 1 ELSE 0 END)/COUNT(CASE WHEN (LOWER(longdescs_0.thetext)
LIKE '%test%' AND LOWER(longdescs_0.thetext) LIKE '%>assignee%') THEN 1 ELSE 0
END) + CASE WHEN (LOWER(bugs.short_desc) LIKE '%test%' AND
LOWER(bugs.short_desc) LIKE '%>assignee%') THEN 1 ELSE 0 END) AS relevance FROM
bugs INNER JOIN profiles AS map_assigned_to ON (bugs.assigned_to =
map_assigned_to.userid) INNER JOIN profiles AS map_reporter ON (bugs.reporter =
map_reporter.userid) INNER JOIN longdescs AS longdescs_0 ON (bugs.bug_id =
longdescs_0.bug_id AND longdescs_0.isprivate < 1) LEFT JOIN bug_group_map ON
bug_group_map.bug_id = bugs.bug_id WHERE ((bugs.bug_status IN
('UNCONFIRMED','NEW','ASSIGNED','REOPENED'))) AND (((CASE WHEN
(LOWER(longdescs_0.thetext) LIKE '%test%' AND LOWER(longdescs_0.thetext) LIKE
'%>assignee%') THEN 1 ELSE 0 END > 0) OR (bugs.short_desc REGEXP
'(^|[^a-z0-9])test($|[^a-z0-9])' AND bugs.short_desc REGEXP
'(^|[^a-z0-9])\\>assignee($|[^a-z0-9])'))) AND bugs.creation_ts IS NOT NULL AND
((bug_group_map.group_id IS NULL)) GROUP BY bugs.bug_id ORDER BY relevance desc
LIMIT 200



Advanced search (MySQL):

SELECT bugs.bug_id, bugs.bug_severity, bugs.priority, bugs.bug_status,
bugs.resolution, map_assigned_to.login_name, map_reporter.login_name,
bugs.bug_status, bugs.resolution, bugs.short_desc,
(SUM(MATCH(longdescs_0.thetext) AGAINST('test >assignee' IN BOOLEAN
MODE))/COUNT(MATCH(longdescs_0.thetext) AGAINST('test >assignee' IN BOOLEAN
MODE)) + MATCH(bugs.short_desc) AGAINST('test >assignee' IN BOOLEAN MODE)) AS
relevance FROM bugs INNER JOIN profiles AS map_assigned_to ON (bugs.assigned_to
= map_assigned_to.userid) INNER JOIN profiles AS map_reporter ON (bugs.reporter
= map_reporter.userid) INNER JOIN longdescs AS longdescs_0 ON (bugs.bug_id =
longdescs_0.bug_id AND longdescs_0.isprivate < 1) LEFT JOIN bug_group_map ON
bug_group_map.bug_id = bugs.bug_id WHERE ((MATCH(longdescs_0.thetext)
AGAINST('test >assignee' IN BOOLEAN MODE) > 0)) AND bugs.creation_ts IS NOT
NULL AND ((bug_group_map.group_id IS NULL)) GROUP BY bugs.bug_id ORDER BY
bugs.bug_id LIMIT 200



Advanced search (non-MySQL):

SELECT bugs.bug_id, bugs.bug_severity, bugs.priority, bugs.bug_status,
bugs.resolution, map_assigned_to.login_name, map_reporter.login_name,
bugs.bug_status, bugs.resolution, bugs.short_desc, (SUM(CASE WHEN
(LOWER(longdescs_0.thetext) LIKE '%test%' AND LOWER(longdescs_0.thetext) LIKE
'%>assignee%') THEN 1 ELSE 0 END)/COUNT(CASE WHEN (LOWER(longdescs_0.thetext)
LIKE '%test%' AND LOWER(longdescs_0.thetext) LIKE '%>assignee%') THEN 1 ELSE 0
END) + CASE WHEN (LOWER(bugs.short_desc) LIKE '%test%' AND
LOWER(bugs.short_desc) LIKE '%>assignee%') THEN 1 ELSE 0 END) AS relevance FROM
bugs INNER JOIN profiles AS map_assigned_to ON (bugs.assigned_to =
map_assigned_to.userid) INNER JOIN profiles AS map_reporter ON (bugs.reporter =
map_reporter.userid) INNER JOIN longdescs AS longdescs_0 ON (bugs.bug_id =
longdescs_0.bug_id AND longdescs_0.isprivate < 1) LEFT JOIN bug_group_map ON
bug_group_map.bug_id = bugs.bug_id WHERE ((CASE WHEN
(LOWER(longdescs_0.thetext) LIKE '%test%' AND LOWER(longdescs_0.thetext) LIKE
'%>assignee%') THEN 1 ELSE 0 END > 0)) AND bugs.creation_ts IS NOT NULL AND
((bug_group_map.group_id IS NULL)) GROUP BY bugs.bug_id ORDER BY bugs.bug_id
LIMIT 200


Some remarks, constatations:

1) The WHERE clause:

- The WHERE clause does not look at the bug summary in the Advanced Search,
meaning that if the words we are looking for are in the bug summary only, we
will miss the bug.

What would prevent us from doing something like:
WHERE (MATCH(longdescs_0.thetext, bugs.short_desc) AGAINST('test >assignee' IN
BOOLEAN MODE) > 0)  ?

If that's not a performance issue, it would make sense to open a bug to
implement this feature.

- Why does "Find a specific bug" use LIKE for comments in the WHERE clause, but
REGEXP for the bug summary? Couldn't we use LIKE too? Maybe would LIKE be
faster than this REGEXP? If yes, open a bug.


2) The ORDER BY clause:

- "Find a specific bug" orders the results by relevance. Advanced Search
doesn't. That's a bug IMO. Especially because buglist.cgi displays "Bugs on
this list are sorted by relevance, with the most relevant bugs at the top.".
But we know that's not true, except when using "Find a specific bug".


But AFAIK, all these points are not related to your patch. So r=LpSolit
Comment 14 Myk Melez [:myk] [@mykmelez] 2005-07-14 19:40:22 PDT
> - The WHERE clause does not look at the bug summary in the Advanced Search,
> meaning that if the words we are looking for are in the bug summary only, we
> will miss the bug.

Right.  That's a known limitation of Bugzilla which is actually fixed in the
Bugzilla tip.  Unfortunately, that fix regresses performance significantly for
large installations, so we've backed out that patch for b.m.o.


> What would prevent us from doing something like:
> WHERE (MATCH(longdescs_0.thetext, bugs.short_desc) AGAINST('test >assignee' IN
> BOOLEAN MODE) > 0)  ?

Well, first of all, you can only list multiple columns from the same table in
the MATCH function (they have to exactly match a set of columns on which a
fulltext index has been created, and you can't create cross-table indexes), so
this would return an error.

You could instead do two separate MATCHes, but then MySQL won't use the fulltext
indexes, and it will be extremely slow (so slow, in fact, that queries will time
out on b.m.o and probably many other Bugzilla installations).

At one point, I knew why this was the case, but I don't quite remember anymore.
 Bug 145588, comment 35 says something about it.  It may have something to do
with MySQL's one index per table limitation, or it may be something about doing
a union of summary and comment matches (i.e. summary matches *OR* comments match).


> - Why does "Find a specific bug" use LIKE for comments in the WHERE clause, 
> but REGEXP for the bug summary? Couldn't we use LIKE too? Maybe would LIKE
> be faster than this REGEXP? If yes, open a bug.

I guess you are referring to the ANSI version.  Presumably, the ANSI version is
not intended to be used unless absolutely necessary, so great care has not been
taken with making it work as well as possible.  I imagine that LIKE would be
faster but REGEXP would be more accurate (since it's possible to specify word
boundaries in it, and fulltext searches are concerned with words, not substring
matches).


> - "Find a specific bug" orders the results by relevance. Advanced Search
> doesn't. That's a bug IMO. Especially because buglist.cgi displays "Bugs on
> this list are sorted by relevance, with the most relevant bugs at the top.".
> But we know that's not true, except when using "Find a specific bug".

Yes.  It's tough, since the advanced search has a sort field that always has
some value selected, but I imagine that most users searching for content want
their results sorted by relevance, so it would make sense to override that value
(but we should find a way to distinguish between users who didn't explicitly set
an order and those who did, perhaps by changing the default sort order to
something like "default").


Thanks for the review!  Checking in...

Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.102; previous revision: 1.101
done
Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.57; previous revision: 1.56
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.26; previous revision: 1.25
done

Note You need to log in before you can comment on or make changes to this bug.