Open Bug 365258 Opened 18 years ago Updated 9 years ago

[PostgreSQL] Support builtin fulltext search capabilities

Categories

(Bugzilla :: Database, enhancement)

2.23.3
enhancement
Not set
normal

Tracking

()

People

(Reporter: bugzilla-mozilla, Unassigned)

References

()

Details

Attachments

(1 file, 5 obsolete files)

As of Postgres 8.2, tsearch2 supports UTF-8. So I guess that needs to be the minimum version for fulltext indexing. Before 8.3, a user manually has to do execute a tsearch2.sql. This will add the necessary 'fulltext' functions to the database.
From what I've read (see URL), as of 8.3 (development only just started) the tsearch2 module is loaded by default into the database. This would make supporting 

I'm not sure how to handle the software requirement. Probably the fulltext should be optional and to enable it you'd need 8.2? Requiring 8.3 would be even better.. but ehr.. no idea when that will be out.
If the requirement starts with 8.2, then I'm not sure how to handle the loading of the tsearch2.sql. It would be best if checksetup.pl would do it for the user. However there are possible license problems and tsearch2.sql might be Pg version specific.
Severity: normal → enhancement
Wow, I can't believe this wasn't already filed. I was talking about this with Tomas way back when we first wrote Pg support. But anyhow, yes we should do this.

We can just check somehow if tsearch is installed and switch the fulltext method used in the sql_fulltext method (or whatever it's called).
(Patch is far from ready. Just some notes to have a more user-friendly explanation than the tsearch2 docs)

To detect tsearch2 I do the following:

$self->{private_have_tsearch2} = $self->selectrow_array(
        "SELECT 1 FROM information_schema.routines
        WHERE routine_name = 'plainto_tsquery' LIMIT 1");

However, I need to pass that information somehow to Bugzilla/DB/Schema/Pg.pm.


To make tsearch2 work:

psql -U bzcvspg < /usr/share/pgsql/contrib/tsearch2.sql

ALTER TABLE longdescs ADD COLUMN vectors tsvector;
CREATE INDEX longdescs_thetext_fulltext ON longdescs USING gist(vectors);
UPDATE longdescs SET vectors = to_tsvector(thetext);
CREATE TRIGGER update_thetext_tsidx BEFORE UPDATE OR INSERT ON longdescs FOR EACH ROE EXECUTE PROCEDURE  tsearch2(vectors, thetext);


To search:

SELECT bug_id, thetext FROM longdescs WHERE vectors @@ plainto_tsquery('preferences');


Sometimes whines about UTF-8 environment (this probably is a hack):

SELECT set_curcfg('default');
Assignee: database → bugzilla-mozilla
Attached patch Work in progress (obsolete) — Splinter Review
Work in progress. Bugs:
 * Don't like the way I added the trigger (ideally it should be added whenever Bugzilla adds a vector column)
 * Detection of tsearch2 could be better (e.g., broken tsearch2 config)
 * Only works when running from a non-existing db (will NOT add FTI to an existing one)
 * Still has debug code
 * Might have broken MySQL
 * Broken when running from an existing db with tsearch2 support, but without the right FTI columns (only setup when table is created)

It does search though ;)

Oh, instructions (DATA LOSS!):
Run psql template1 $USER, execute the following
  DROP DATABASE $DATABASE
  CREATE DATABASE $DATABASE
psql $DATABASE $USER < /usr/share/pgsql/contrib/tsearch2.sql
Attached patch Alpha 1 (obsolete) — Splinter Review
This version:
 * Should not break MySQL
 * Detects if tsearch2 is available, and will add the needed columns/indexes
 * content search should work with and without tsearch2
 * I think there is still a bug in there when upgrading existing Pg databases (ones created without this patch)... should still work.. I think it won't add the GIN index

I'm mainly looking for feedback on how this is coded, and if it is ok or not.

The main problem is the changing of the {abstract,}schema. I will not know if it is needed until we have a db connection. However, the bz_schema object never deals with db connections! So I don't know if my solution is 'proper'.

Oh, btw. FTI on Pg (when you want any kind of speed):
 * new column that contains 'vectors'.. basically some info regarding words in thetext
 * a trigger to update this column whenever the columns it is based upon are changed
 * an gin or gist index for this column

You always search using the 'vector' column. Automatically uses the index when available. It would be possible to do a FTI without this column. However, that would be *very* slow!!!

Simple difference between GIN and GIST:
  GIN = fast searches, slow updates
  GIST = fast updates, slow searches

(see docs for better explanation)
Attachment #278287 - Attachment is obsolete: true
Attachment #278332 - Flags: review?(mkanat)
Comment on attachment 278332 [details] [diff] [review]
Alpha 1

Okay, I didn't look over this in *great* detail, but instead of putting that code inside of Schema::new, it should be inside of DB::Pg::bz_setup_database.

The index name should probably also contain the table name.

Add ->bz_supports_fulltext to Bugzilla::DB as a way of checking if FULLTEXT indexes should be added and used. For MySQL it can always return 1, and for Pg it can check if tsearch2 is enabled in the DB.

Definitely don't put Pg-specific do() code into Bugzilla::Install::DB. bz_add_index should handle that itself. (Meaning that _get_create_index_ddl should handle it).

It's OK to assume that once a DB supports tsearch2, it supports it forever. That is, you don't have to support ever migrating away from tsearch2.
Attachment #278332 - Flags: review?(mkanat) → review-
Status: NEW → ASSIGNED
Attached patch Work in progress (obsolete) — Splinter Review
Attachment #278332 - Attachment is obsolete: true
Some questions before I can hack it further. I've attached the current patch just as a FYI (not every comment has been addressed yet)

(In reply to comment #5)
> (From update of attachment 278332 [details] [diff] [review])
> Okay, I didn't look over this in *great* detail, but instead of putting that
> code inside of Schema::new, it should be inside of DB::Pg::bz_setup_database.

I made it more generic and placed most in Bugzilla::DB.

> The index name should probably also contain the table name.

Ehr? I did so.
Or do you mean the vector column? That is already within a table. Same for the trigger (table specific).

> Add ->bz_supports_fulltext to Bugzilla::DB as a way of checking if FULLTEXT
> indexes should be added and used. For MySQL it can always return 1, and for Pg
> it can check if tsearch2 is enabled in the DB.

done

> Definitely don't put Pg-specific do() code into Bugzilla::Install::DB.
> bz_add_index should handle that itself. (Meaning that _get_create_index_ddl
> should handle it).

The do code is a trigger to ensure the vector column is always up to date.

Should I do the vector column creation also within the _get_create_index_ddl? If so, I'll need to allow it to return multiple SQLs. Further, this means the vector will not be known by the schema (as it won't be created via a nice bz_add_column).

> It's OK to assume that once a DB supports tsearch2, it supports it forever.
> That is, you don't have to support ever migrating away from tsearch2.

ok.
Attached patch Logic in _get_create_index_ddl (obsolete) — Splinter Review
Untested

This dumps most of the logic inside _get_create_index_ddl. This means the schema won't know about the vector column, but perhaps that is good?

Oh, tsearch2 docs suggest a vacuum. However, I think that should be a generic operation (e.g. Bugzilla should do that always when adding indexes).
(In reply to comment #7)
> > The index name should probably also contain the table name.
> 
> Ehr? I did so.
> Or do you mean the vector column? That is already within a table. Same for the
> trigger (table specific).

  Oh, sorry. :-) I must have misunderstood that.

> Should I do the vector column creation also within the _get_create_index_ddl?
> If so, I'll need to allow it to return multiple SQLs. Further, this means the
> vector will not be known by the schema (as it won't be created via a nice
> bz_add_column).

  Yes, that's fine. The schema should assume that wherever there's a FULLTEXT index, there's a vector, so it doesn't have to know about the vector specifically as long as it knows the index is there.

  _get_add_index_ddl (which is what bz_add_index calls) is designed to return multiple SQL calls anyhow.
Attached patch Patch v3 (obsolete) — Splinter Review
Changes vs previous review:
 * Added some perldoc
 * _get_create_index_ddl has all the logic (vector column, trigger, ensuring vector is initialized with thetext data, index itself)
 * Bugzilla::DB uses bz_supports_fulltext to determine if FULLTEXT indexes should be removed from the schema (that code is hairy)
 * Changes _get_create_index_ddl to return multiple SQL statements (and changes all functions that assumed just one)

And of course:
 * Works with and without tsearch2
 * Doesn't break MySQL (but be aware that to test MySQL FULLTEXT support you need at least 3 comments)
Attachment #278429 - Attachment is obsolete: true
Attachment #278436 - Attachment is obsolete: true
Attachment #278459 - Flags: review?(mkanat)
Attached patch Patch v3.1Splinter Review
Changes:
 - realised that my fixes to Pg actually broke the MySQL ordering (0/1 instead of the ranking score produced by MySQL).

Unlike MySQL, Pg doesn't like SUM(boolean) or WHERE (1).  So after I fixed that I did not realise that this impacted the ranking of MySQL.
Attachment #278459 - Attachment is obsolete: true
Attachment #278593 - Flags: review?(mkanat)
Attachment #278459 - Flags: review?(mkanat)
Comment on attachment 278593 [details] [diff] [review]
Patch v3.1

>@@ -406,6 +419,39 @@ sub bz_get_field_defs {
> sub bz_setup_database {
> [snip]
>+
>+    # Remove FULLTEXT index types from the schemas if the database does not 
>+    # support it
>+    if (!$self->bz_supports_fulltext()) {

  How is it correct to move this code from Schema::Pg::_intialize to Bugzilla::DB::bz_setup_database?


>Index: Bugzilla/DB/Pg.pm
>@@ -70,6 +71,9 @@ sub new {
> 
>     my $self = $class->db_new($dsn, $user, $pass);
> 
>+    $self->selectrow_array("select set_curcfg('default');")
>+      if $self->bz_supports_fulltext;

  That should be "do", right? 

  What's that do, by the way? Maybe add a comment above it.

>+sub sql_fulltext_search {
>+    my $self = shift @_;

  Nit: You don't need the @_ here. It might confuse people, too, since we don't do it anywhere else in Bugzilla.

>Index: Bugzilla/DB/Schema.pm

  Why are these changes necessary? We can add FULLTEXT indexes just fine to MySQL without this...

>Index: Bugzilla/DB/Schema/Mysql.pm
>@@ -143,14 +143,16 @@ sub _get_create_index_ddl {

  The changes here aren't necessary.

>@@ -81,6 +63,8 @@ sub _initialize {
>+        VECTOR   =>     'tsvector',

  Is there some reason to add that? You could just specify tsvector directly in the TYPE arguments--that works fine.

>@@ -163,4 +147,41 @@ sub _get_alter_type_sql {
>+    # Ensure this column always reflects the index fields
>+    my $sql_trigger = "CREATE TRIGGER fti_$vector_column BEFORE UPDATE OR INSERT

  That shouldn't be AFTER UPDATE OR INSERT?


  I'm also concerned about the upgrade path from this to PostgreSQL 8.3, as I said on IRC. Until 8.3 comes out of beta, though, we don't have to support it at all. Just promise me that this *can* somehow be upgraded and that we can detect the 8.3 changes and so forth.
Attachment #278593 - Flags: review?(mkanat) → review-
(In reply to comment #12)
> (From update of attachment 278593 [details] [diff] [review])
> >@@ -406,6 +419,39 @@ sub bz_get_field_defs {
> > sub bz_setup_database {
> > [snip]
> >+
> >+    # Remove FULLTEXT index types from the schemas if the database does not 
> >+    # support it
> >+    if (!$self->bz_supports_fulltext()) {
> 
>   How is it correct to move this code from Schema::Pg::_intialize to
> Bugzilla::DB::bz_setup_database?

Because the tsearch2 support in Postgresql is optional (in 8.2.. it is builtin in 8.3, but I'd still need a db connection to detect that) and I wouldn't know if Pg supported it until the db connection was made. The Schema stuff doesn't have access to the db connection at that point.

> >Index: Bugzilla/DB/Pg.pm
> >@@ -70,6 +71,9 @@ sub new {
> > 
> >     my $self = $class->db_new($dsn, $user, $pass);
> > 
> >+    $self->selectrow_array("select set_curcfg('default');")
> >+      if $self->bz_supports_fulltext;
> 
>   That should be "do", right? 
> 
>   What's that do, by the way? Maybe add a comment above it.

Forgot... selects some tsearch2 config. If you don't do this, it causes weird bugs. It is a select, so I'm using a select thingy to avoid any possible 'finish' whining.

> >+sub sql_fulltext_search {
> >+    my $self = shift @_;
> 
>   Nit: You don't need the @_ here. It might confuse people, too, since we don't
> do it anywhere else in Bugzilla.

ok

> >Index: Bugzilla/DB/Schema.pm
> 
>   Why are these changes necessary? We can add FULLTEXT indexes just fine to
> MySQL without this...

This is about Postgresql, not MySQL. It needs multiple SQL statements to setup the FTI (3, one to add a tsvector column, another to create a trigger, then another to index that tsvector column). Which is why I am changing DB/Schema.pm to allow multiple SQL statements. See Schema/Pg.pm.

> >Index: Bugzilla/DB/Schema/Mysql.pm
> >@@ -143,14 +143,16 @@ sub _get_create_index_ddl {
> 
>   The changes here aren't necessary.

I disagree. I changed Schema.pm so I need to change the MySQL function as well (it excepts an array/list).

> >@@ -81,6 +63,8 @@ sub _initialize {
> >+        VECTOR   =>     'tsvector',
> 
>   Is there some reason to add that? You could just specify tsvector directly in
> the TYPE arguments--that works fine.

Thought that would be cleaner.

> >@@ -163,4 +147,41 @@ sub _get_alter_type_sql {
> >+    # Ensure this column always reflects the index fields
> >+    my $sql_trigger = "CREATE TRIGGER fti_$vector_column BEFORE UPDATE OR INSERT
> 
>   That shouldn't be AFTER UPDATE OR INSERT?

No, because that would be too late. Then you'd need to do another UPDATE table SQL statement. The vectors need to be calculated based upon the current thetext and have to be updated in the tsvector column. This is per tsearch docs.
With an AFTER UPDATE you'd need another SQL to update the tsvector column.

>   I'm also concerned about the upgrade path from this to PostgreSQL 8.3, as I
> said on IRC. Until 8.3 comes out of beta, though, we don't have to support it
> at all. Just promise me that this *can* somehow be upgraded and that we can
> detect the 8.3 changes and so forth.

I'll check later. Upgrading any Pg is a pain though (except with Mandriva).
Bugzilla 3.2 is now frozen. Only enhancements blocking 3.2 or specifically approved for 3.2 may be checked in to the 3.2 branch. If you would like to nominate your enhancement for Bugzilla 3.2, set "blocking3.2" tp "?", and either the target milestone will be changed back, or the blocking3.2 flag will be granted, if we will accept this enhancement for Bugzilla 3.2.
Target Milestone: Bugzilla 3.2 → Bugzilla 4.0
Assignee: bugzilla-mozilla → database
Status: ASSIGNED → NEW
Now that we require PostgreSQL 8.3, we can use the built-in fulltext system, if somebody wants to work on that.
We are going to branch for Bugzilla 4.4 next week and this bug is either too invasive to be accepted for 4.4 at this point or shows no recent activity. The target milestone is reset and will be set again *only* when a patch is attached and approved.

I ask the assignee to reassign the bug to the default assignee if you don't plan to work on this bug in the near future, to make it clearer which bugs should be fixed by someone else.
Target Milestone: Bugzilla 4.4 → ---
Summary: [PostgreSQL] Use tsearch2 for fulltext indexing when available → [PostgreSQL] Use tsearch2 for fulltext indexing
Summary: [PostgreSQL] Use tsearch2 for fulltext indexing → [PostgreSQL] Support builtin fulltext search
Summary: [PostgreSQL] Support builtin fulltext search → [PostgreSQL] Support builtin fulltext search capabilities
We now require Pg 9.0. Matt: is this something you would like to work on?
I had an experimental patch for FTS in the BRC codebase.  I had to back it out as it was causing searches to break.

The way postgres does full text searching is a bit different to MySQL.  It strips out non-english characters and punctuation when it saves the data to the tsvector columns.  This can be a problem for searching as people often like to search on error messages, which contain those characters in abundance.

When creating the search SQL, you need to take care on what terms you use in the search query you pass to postgres.  Mixing punctuation type characters in there can cause the search to fail with an SQL error.  Any code that generates the SQL to pass on to Postgres will need to strip out the offending characters.

Postgres also has a limit on how big a TSVECTOR column can be.  Often people will create a bug and copy-paste a wall of debug output or a crazy error message.  Postgres can decided that this is too much data for a TSVECTOR and will refuse to store it.  This means that the content can't ever be searched for.

It might be possible to address all of these problems, but it looks like it will be much more effort than I expected.  Currently we are planning to migrate BRC to bugzilla 5.0 and I don't have time to work on it at present.

I'm not sure postgre FTS is suitable for use in Bugzilla.  It seems like it's more designed to index the contents of a book.  Storing error messages with all sorts of odd symbols seems to be not what it was designed for.  I'm going to have to do a lot more research on it before attempting to implement it again.

Anyway, I'm not an expert on postgres FTS, so maybe someone else can chime in with an answer.
Well, if the benefit for Bugzilla is negligible, then it doesn't worth the effort. We can leave the bug open in case things become better in a future Pg release.
You need to log in before you can comment on or make changes to this bug.