Closed Bug 532156 Opened 15 years ago Closed 15 years ago

Search v3 - meta bug

Categories

(support.mozilla.org :: Search, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jsocol, Unassigned)

References

Details

(Whiteboard: sumo_only)

This is just an tracker for Louis-Philippe for upstreaming the latest search work.

This is essentially self-contained in webroot/search.php, webroot/templates/search.tpl, webroot/lib/search/ and the scripts/sphinx/ directory.

Paul, if I'm leaving anything out, please point it out here.

NB: This is not a "proper" SUMO bug in that all the work is already done.
Looks good on first pass.
Just a note to LPH that you should probably wait with upstreaming this until, say, a week because we haven't yet pushed this to prod and we may discover some small bugs. 

Otherwise we'll have to mark any subsequent fix as tiki_bug, which should be avoided if possible.

Thanks to all!
Just looking through the code. Some refactoring would be helpful before upstreaming. I know I'm annoying with this, but this is once again a case of sumo-specific code. Sure we want sphinx search in tiki, but it can't be bound to the features used by sumo so directly.

First off, the query in indexer is massive. Counting 11 joins. Using the libraries more would make it more resistant to changes in tiki and allow the community to add other feature-support to it. Sure it might not be as efficient, but I doubt that query is anyway.

Why does it force the master? Doesn't indexing from a slave make sense?

Speaking of efficiency, xmlpipe2 can be used as a pipe. If you write the XML directly to the output stream, you can use "xmlpipe_command = php sphinx/indexer.php" and index the content as it's generated. From that moment, getting PHP to get data when it needs it in order to write the content to the output as fast as possible (rather than wait for a large query) is a lot more efficient.

The indexer also contains a chunk of code to filter the wiki syntax to remove undesired elements. I would like to see that move to a common library instead. Once again, this way it could be maintained and evolve. For one thing, maketoc is a special case, but since version 3.0, plugin execution was modified. It could be possible to simply disable the redirect plugin for the execution. Having that in a library would help think about this code differently.

In terms of sphinx usage, I would rather have the schema definition be generated within the indexer. It makes it easier to add fields conditionally and avoid the configuration spread.

All of the values in search.conf.php should really be stored as preferences within tiki.

Inclusion of sphinxapi.php should be made conditional, in case the extension is installed.

I don't really understand why search.php contains anything about weights. It could easily all be moved within the library and be transparent.

As a general rule, because we no longer support PHP4, private methods should be marked as such.
Whiteboard: tiki_feature → tiki_feature tiki_discuss
(In reply to comment #3)
> First off, the query in indexer is massive. Counting 11 joins. ... Sure it might not be as
> efficient, but I doubt that query is anyway.

If you think that's bad, check out AMO's: http://bit.ly/5sWLkw. For what it's worth, this runs in < 2 minutes on our production machine, and in around 5 in my local VM, with our > 500k database rows. (With a warm start, anyway. It can take 12-15 minutes for me with a cold disk cache.)

> Why does it force the master? Doesn't indexing from a slave make sense?

I'm not sure I understand here. Just point localsettings.py to a slave. I can't find anything in the PHP accessing TikiDB::forceMaster(). Where are you seeing this?

> Speaking of efficiency, xmlpipe2 can be used as a pipe. If you write the XML
> directly to the output stream, you can use "xmlpipe_command = php
> sphinx/indexer.php" and index the content as it's generated. From that moment,
> getting PHP to get data when it needs it in order to write the content to the
> output as fast as possible (rather than wait for a large query) is a lot more
> efficient.

You could also use sql_range_query and sql_range_step. I don't know how multi-threaded the indexer is or what kind of blocking it does on the query, but it seems pretty intelligent. Like I said, doing this takes less than 2 minutes. Our old version, using PHP to generate XML took 15-30.

> The indexer also contains a chunk of code to filter the wiki syntax to remove
> undesired elements. I would like to see that move to a common library instead.
> Once again, this way it could be maintained and evolve. For one thing, maketoc
> is a special case, but since version 3.0, plugin execution was modified. It
> could be possible to simply disable the redirect plugin for the execution.
> Having that in a library would help think about this code differently.

The indexer ignores articles that start with {REDIRECT(), that's all. I know it's not a perfect solution but it fits our needs. It doesn't do any other filtering on Wiki markup--it can't. We can't make every wiki plugin a stop word. It can, however, ignore HTML, so a potential improvement is storing the wiki pages fully rendered and then indexing those.

SphinxLib::clean_markup() is the filtering function.

> In terms of sphinx usage, I would rather have the schema definition be
> generated within the indexer. It makes it easier to add fields conditionally
> and avoid the configuration spread.

I'm sure you could modify sphinx.conf to query the database for a schema and then generate the appropriate Sphinx configuration, but most of the sub queries and sql_attr_multis (last_updated for a forum thread, tags for wiki pages for example) wouldn't be available unless they were special cases, anyway.

> All of the values in search.conf.php should really be stored as preferences
> within tiki.

Many of them should, but a good number are just constants, not settings, and should really be moved to some kind of central location. (See bug 532004 for similar refactoring.) One thing I'd like to change, while doing this, is moving the SEARCH_WHERE_* constants to a bit mask.

> Inclusion of sphinxapi.php should be made conditional, in case the extension is
> installed.

Are there any meaningful changes between the two? Looking briefly it looks like PECL/Sphinx's SphinxClient methods start with lowercase letters, and sphinxapi.php's start with upper case--PHP is OK with that (though it shouldn't be) but is everything else the same?

> I don't really understand why search.php contains anything about weights. It
> could easily all be moved within the library and be transparent.

Agreed. There's a bunch of stuff like this that could be cleaned up.

> As a general rule, because we no longer support PHP4, private methods should be
> marked as such.

Agreed. All class/object methods and properties should have defined visibility.
> If you think that's bad, check out AMO's: http://bit.ly/5sWLkw. For what it's
> worth, this runs in < 2 minutes on our production machine, and in around 5 in
> my local VM, with our > 500k database rows. (With a warm start, anyway. It can
> take 12-15 minutes for me with a cold disk cache.)

Still, this kind of query causes trouble if upstreamed as it does not take all features into account. Keeping this kind of structure will just make the query grow even bigger.

The indexer is just not upstreameable like this. I can understand the goal of SUMO to have an optimized version due to the volume, but I can't upstream this to the community unless there is a version that can work for anyone, even if it has to be slower.

> 
> I'm not sure I understand here. Just point localsettings.py to a slave. I can't
> find anything in the PHP accessing TikiDB::forceMaster(). Where are you seeing
> this?

$tikilib->forceMaster();
scripts/sphinx/indexer.php:41

> > In terms of sphinx usage, I would rather have the schema definition be
> > generated within the indexer. It makes it easier to add fields conditionally
> > and avoid the configuration spread.
> 
> I'm sure you could modify sphinx.conf to query the database for a schema and
> then generate the appropriate Sphinx configuration, but most of the sub queries
> and sql_attr_multis (last_updated for a forum thread, tags for wiki pages for
> example) wouldn't be available unless they were special cases, anyway.

I just mean moving the schema from sphinx.conf to formally declaring it as part of the XML, so it can be dynamically generated based on the enabled feature.

> > Inclusion of sphinxapi.php should be made conditional, in case the extension is
> > installed.
> 
> Are there any meaningful changes between the two? Looking briefly it looks like
> PECL/Sphinx's SphinxClient methods start with lowercase letters, and
> sphinxapi.php's start with upper case--PHP is OK with that (though it shouldn't
> be) but is everything else the same?

I assumed you did not write that one yourselves and that it was an exact clone. If the class is not identical, it should have a different name, otherwise it would crash on servers with the sphinx extension installed.
Talking amongst ourselves, I think ultimately we may be at an impasse with this feature. We need it to index the features we use, quickly and minimally. A major sticking point in Search v3 was switching from xmlpipe2 to direct SQL queries, which is not really a feasible solution upstream, from your comments. We have no desire to move back: building XML through PHP was just not performant, and it's not an efficient way to use Sphinx.

We've used /search.php as our search script. If, upstream, you want to use any of this code, and it's in /tiki-search.php, then there should be no conflict between our implementations.

I see no reason why this implementation of a Sphinx-based search needs to be upstreamed if it (a) doesn't meet the general needs of Tiki and (b) can co-exist with the Tiki search.

As long as those are true, I'm comfortable making this a sumo_only feature. 

Obviously the source is all there, if it's helpful to anyone building a generic Tiki Sphinx-backed search engine. We're around to share what insight we gleaned from our work, &c. But I think we have different needs here and trying to please everyone isn't going to work.
Thanks for the insightful comment, James. I think this makes a lot of sense. There is little point in trying to shoehorn our solution into upstream Tiki if it only makes sense for SUMO. If our implementation can co-exist with the Tiki search, we can definitely mark this as sumo_only.
sumo_only it is, but it's a sad conclusion.
-> sumo_only.

We just have different needs than the rest of the Tiki community here. Like I said, we're around and glad to help where we can if someone is interested in building a Sphinx search feature upstream.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Whiteboard: tiki_feature tiki_discuss → sumo_only
Blocks: 544528
You need to log in before you can comment on or make changes to this bug.