Add Nutch Search functionality to MDC

RESOLVED FIXED

Status

developer.mozilla.org
General
RESOLVED FIXED
12 years ago
6 years ago

People

(Reporter: sancus, Assigned: sancus)

Tracking

Details

Attachments

(4 attachments, 5 obsolete attachments)

(Assignee)

Description

12 years ago
The nutch search engine needs to be added to MDC as a Special Page as a replacement for the built-in mediawiki search.
(Assignee)

Comment 1

12 years ago
Created attachment 231817 [details] [diff] [review]
Nutch Special Page
Attachment #231817 - Flags: review?(shaver)
(Assignee)

Comment 2

12 years ago
Created attachment 231818 [details] [diff] [review]
Helper class for the Special Page
Attachment #231818 - Flags: review?(shaver)
(Assignee)

Comment 3

12 years ago
Created attachment 231820 [details] [diff] [review]
Load test for the opensearch servlet
Attachment #231820 - Flags: review?(shaver)
(Assignee)

Comment 4

12 years ago
Created attachment 231821 [details] [diff] [review]
Unit Tests
Attachment #231821 - Flags: review?(shaver)
(Assignee)

Comment 5

12 years ago
Created attachment 231826 [details] [diff] [review]
Nutch Special Page

Fixed white space.
Attachment #231817 - Attachment is obsolete: true
Attachment #231826 - Flags: review?(shaver)
Attachment #231817 - Flags: review?(shaver)
Comment on attachment 231818 [details] [diff] [review]
Helper class for the Special Page

Random question: should this be called mzOpenSearch instead of mzNutch, because
it doesn't actually have any dependencies on the service being Nutch?  (We
should use mz rather than mw in prefixes for our classes, to avoid collisions
and confusion with future MW-core classes.) 

><?php
>/*
>This is a helper class for the MDC Nutch Special Page
>Configuration is at the top of the class, you should set
>$sitename to the site that is being searched, $opensearch_url
>to the url of the nutch os servlet and $xsl_file to the
>.xsl file being used to transform the xml
>*/
>Class mwNutch {
>	var $sitename="developer.mozilla.org";
>	var $opensearch_url="localhost:443/opensearch";

The opensearch URL should be global config variables that we set in LocalConfig, as we do for the path to the doxygen XML dir in that extension.  If $mzOpenSearchURL isn't set in the config, you should emit an error when the class is constructed or queried, to help diagnose.

Just use $wgServerName for the sitename, instead of a custom variable.

>	var $xsl_file="extensions/SpecialNutch.xsl";

I don't see a reason to make this configurable; it's basically a source file,
and if we wanted to configure it we'd probably move the XSL work to the skin
instead.  (Fine to leave it as a var here, though, just update the comment.)

>	function xtDataWithFile($xmlData) {
>  	// wrapper for xslt_process to make things a bit easier
>  	$args = array ( '/_xml' => $xmlData );
>  	$xp = xslt_create();
>  	$out = xslt_process($xp, 'arg:/_xml', $this->xsl_file, NULL, $args);
>  	xslt_free($xp);
>  	return $out;
>	}
>	
>	function queryNutch($query, $hitsperpage, $start, $language) {
>		$ch = curl_init();
>    $langstring = $this->sitename . "/" . $language . "/docs+";

Inconsistent whitespace/indentation (depth, tabs-vs-spaces) throughout.  Please
use soft tabs or space instead of tabs, whatever your editor calls it!

>    $query = preg_replace('/\s+/', '+', $query);

Is this all we need to escape?  I suspect we want to use an existing make-URL-safe function from PHP here, so that we work for & and other
possibly-problematic characters.

(You'll of course add a test for strange characters and non-ASCII stuff as part
of that process! :) )

>    //setup search string 		
>    $search = "?query=" . "url:" . $langstring . $query . "&hitsPerPage=" . $hitsperpage . "&hitsPerSite=0" . "&start=" . $start;

Worth wfDebug()ing the computed URL here, so it's easier to trace what's being
done via debug logs?

>    curl_setopt($ch, CURLOPT_URL, $this->opensearch_url . $search);
>    curl_setopt($ch, CURLOPT_HEADER, false);
>    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
>    
>    // grab XML
>    $xml = curl_exec($ch);

What happens to errors here?  It would be good to get a meaningful diagnostic
somewhere.

>    curl_close($ch);
>    return $xml;
>	}
>	
>}
>?>
Attachment #231818 - Flags: review?(shaver) → review-
Comment on attachment 231821 [details] [diff] [review]
Unit Tests

><?php
>/*Unit Tests for the MDC Nutch Special Page*/

Nit: not really for the special page, but for the Nutch/OpenSearch class.

>require_once('SimpleTest/unit_tester.php');
>require_once('SimpleTest/reporter.php');
>require_once('/var/www/html/en/docs/extensions/SpecialNutch_Class.php');

This needs to be not quite as hard coded, though I suspect you'll need to make
some assumptions about where that's found.  The MW tests all seem to use PHPUnit,
which looks like it wouldn't be much of a change later, but you could still just
say that this page needs to find the extensions dir in ../extensions, and that
SimpleTest needs to be in the INC_PATH for now.

>class TestOfNutch extends UnitTestCase {
>
>  function TestOfNutch() {
>    $this->UnitTestCase('Testing Nutch Opensearch Query');
>  }
>  function testQueryingNutch() {
>    $nutch = new mwNutch;
>    $xml = $nutch->queryNutch("mdc", "0", "10", "en");
>    $this->assertWantedPattern("/rss xmlns:nutch/", $xml);
>  }
>  function testTransformNutchXML() {
>    $nutch = new mwNutch;
>    $xml = $nutch->queryNutch("xul tutorial", 0, 10, "en");
>    $output = $nutch->xtDataWithFile($xml, "os.xsl");
>    $this->assertWantedPattern("/<p>Sorry, no results were found\.<\/p>/", $output);
>  }

Are you asserting here that "xul tutorial" _doesn't_ find any hits, or am I
just reading it incorrectly?

>  function testSearchEN() {
>    $pass = HelperFunctions::TestLanguage("en", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchDE() {
>    $pass = HelperFunctions::TestLanguage("de", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchCA() {
>    $pass = HelperFunctions::TestLanguage("ca", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchCN() {
>    $pass = HelperFunctions::TestLanguage("cn", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchFR() {
>    $pass = HelperFunctions::TestLanguage("fr", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchIT() {
>    $pass = HelperFunctions::TestLanguage("it", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchJA() {
>    $pass = HelperFunctions::TestLanguage("ja", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchNL() {
>    $pass = HelperFunctions::TestLanguage("nl", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchPL() {
>    $pass = HelperFunctions::TestLanguage("pl", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchPT() {
>    $pass = HelperFunctions::TestLanguage("pt", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchES() {
>    $pass = HelperFunctions::TestLanguage("es", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  function testSearchKO() {
>    $pass = HelperFunctions::TestLanguage("ko", 100, "mdc");
>    $this->assertEqual($pass[0], $pass[1]);
>  }
>  
>}

>class HelperFunctions {

>  function TestLanguage($lang = null, $expected = 0, $search = null) {
>    /*
>    Checks to see if searching for a given language returns
>    links only to that language wiki. Will modify the number
>    of expected links below the passed value if less results
>    are returned, in case some searches simply don't return
>    that many links.
>    Always fails if fewer than 5 links are returned.
>    */

(Multi-line comments want to have * at the beginning of each line, to make
life easier on us dinosaurs who don't always have syntax highlighting. :) )

It seems like you want to have the TestLanguage method be in your TestOfNutch
(really TestOfOpenSearch?) class so that it can call $this->assertMumble.  That
would give you two benefits:

- You wouldn't have to duplicate the somewhat-cryptic
  $this->assertEqual($pass[0], $pass[1]);
code in each test that calls the same helper function!

- You could assert that each URL has the prefix, rather than doing it indirectly
via the counted results.  That would make failures more descriptive and useful,
and also make the code cleaner.

>    $nutch = new mwNutch;
>    $xml = $nutch->queryNutch($search, $expected, 0, $lang);

Given that neither $search nor $expected vary, seems like you could avoid
repetition by just removing them as parameters.  "testLanguage" could just take
the language code, then.

>    $stuff = explode("\n", $xml);
>    $k=0;

Imagine how thrilled I am about those variable names there. :)

>    //get the number of results returned
>    foreach ($stuff as $line) {
>      if (strstr($line, "<opensearch:totalResults>")) {
>        sscanf($line, "<opensearch:totalResults>%d</opensearch:totalResults>", $count_results);
>        break;
>      }
>    }
>    
>    //if number of overall results is less than expected, modify expected
>    if ($count_results < $expected) $expected = $count_results;
>    //set an absolute minimum of 5 links, so the test will fail on any less
>    if ($expected < 5) $expected = 5; 

Especially given that test, what's the point of having the caller give us the
expected number at all?  Or maybe it should be called "maxResults"?

>    //check that all results are for the given language
>    $langstring = "developer.mozilla.org/" . $lang . "/docs";

Hard-coding sitenames makes me a bit sad, esp since it means we have to
crawl/etc. the live site to set up a test environment.  Do we care in fact that the site name matches, for the language test?  Seems like we could use strip
out the hostname blindly and check for the lang prefix, or (better) construct a
regex that would skip it (untested: "[^/]+/$lang/docs", +/- the escaping of / if
needed) to pass to assertMatches or whatever.

>    foreach ($stuff as $line) {
>      if (strstr($line, "<link>")) {
>        if (strstr($line, $langstring)) $k++;
>      }

(This code will, I hope, go away, but it seems like you would combine those
tests with an &&, or use a combined regexp.)

>    }
>    return array($expected, $k);
>  }
>}
Attachment #231821 - Flags: review?(shaver) → review-
Comment on attachment 231826 [details] [diff] [review]
Nutch Special Page

Nit: 4-space indent is morally superior.

><?php
>$wgExtensionFunctions[] = 'wfSpecialNutch';

mzSpecialNutch, I think.

>$wgExtensionCredits['specialpage'][] = array(
>  'name' => 'Nutch',
>  'author' => 'Andrei Hajdukewycz',
>  'description' => '[[Special:Nutch|a special page]] that allows access to the Nutch search engine',
>  'url' => 'http://developer.mozilla.org'
>);

Recurring: "Nutch" or "OpenSearch"?

>  $wgMessageCache->addMessages(
>    array(
>      'nutch' => 'Nutch Search',
>    )
>  );

What needs to be done to localize this?  Or is it not shown to site visitors in
the normal course of their searching?

(File a follow-up bug on l10n of this stuff, we can figure that out in parallel
to the deployed testing I think.)

>  require_once "$IP/includes/SpecialPage.php";
>  require_once "SpecialNutch_Class.php";
>  
>  class SpecialNutch extends SpecialPage {
>    function SpecialNutch() {
>      SpecialPage::SpecialPage( 'Nutch' );
>    }
>    
>    function execute( $par ) {
>      global $wgRequest, $wgOut;
>      $nutch = new mwNutch;
>      $query = $_GET['query'];
>      $lang = $_GET['language'];
>      $hitsperpage = $_GET['hitsPerPage'];
>      $start = $_GET['start'];
>      $me = $_SERVER['PHP_SELF'];
>      
>      
>      
>      
>      $wgOut->setPagetitle("Nutch Search: $query");

Another localization gotcha.

>    
>
>      if (isset($_GET['query']) == FALSE){

Vertical whitespace stocks are at critically-low levels.  Please conserve!

>        $me = $_SERVER['PHP_SELF'];
>        $wgOut->addHTML("<form name=\"form1\" method=\"get\" action=\"$me\">
>                      Nutch Mediawiki Search<br>
>                      <input type=\"text\" name=\"query\"><br>
>                      <select name=\"language\">
>                      <option value=\"en\" selected>English
>                      <option value=\"ca\" >Catal&#224;
>                      <option value=\"cn\" >Chinese
>                      <option value=\"de\" >Deutsch
>                      <option value=\"fr\" >Fran&#231;ais
>                      <option value=\"it\" >Italiano
>                      <option value=\"ja\" >Japanese
>                      <option value=\"nl\" >Nederlands
>                      <option value=\"pl\" >Polski
>                      <option value=\"pt\" >Portugu&#234;s
>                      <option value=\"es\" >Espa&#241;ol
>                      <option value=\"ko\" >Korean
>                      </select>
>                      <input type=\"hidden\" name=\"start\" value=\"0\">
>                      <input type=\"hidden\" name=\"hitsPerPage\" value=\"10\">
>                      <input type=\"submit\" value=\"Search\">  
>                      </form>");
>      }

A little gross, but the next choice is to have the skin render it, so we'll
want a follow-up bug on that to make sure we at least catch it during the
upcoming skin update.

We will need to pull in localized labels for the submit button and such, though.

Is there really no list of locales and "native" names anywhere that could make
this a little more dynamic?

>      else {
>        
>        //check input for validity, escape query
>        $query = mysql_real_escape_string($query);
>        if (!preg_match("/^[a-z]{2}$/", $lang)) die("Bad language input");
>        if (!preg_match("/^[0-9]{2}$/", $hitsperpage)) die("Bad hitsperpage input");
>        if (!preg_match("/^[0-9]{1,15}$/", $start)) die("Bad start input");
>        
>        $xml = $nutch->queryNutch($query, $hitsperpage, $start, $lang);
>        
>        //do the transform, xsl file is set in the $nutch object
>        $xmlstuff = $nutch->xtDataWithFile($xml);
>        
>        //output final search results
>        $wgOut->addHTML("$xmlstuff");
>        
>        //adding next page and previous page links, 10 results per page
>        if ($start>=10) {
>          $previous = $start - 10;
>          $wgOut->addHTML("<a href=\"$me?query=$query&language=$lang&start=$previous&hitsPerPage=10\">Previous Page</a>&nbsp;");
>        }
>        $next = $start+10;
>        $wgOut->addHTML("<a href=\"$me?query=$query&language=$lang&start=$next&hitsPerPage=10\">Next Page</a>");

Mo' labels, mo' l10n.

>     }    
>      
>    }
>  }
>  SpecialPage::addPage( new SpecialNutch );
>}


r=shaver if the Nutch-vs-OpenSearch debate is settled, and the nits about
formatting and naming are fixed.
Attachment #231826 - Flags: review?(shaver) → review+
(Assignee)

Comment 9

12 years ago
Created attachment 234423 [details]
Helper Class v2

"Random question: should this be called mzOpenSearch instead of mzNutch"

Addressed this with shaver, the extension may not be dependent on Nutch right now, but almost certainly will be in the future due to Nutch's output extending the OS output with its own tags.

"The opensearch URL should be global config variables that we set"

fixed, and added existence checking

"I suspect we want to use an existing make-URL-safe function from PHP here"

using urlencode now.

Also added some error checking and wfDebug stuff.
Attachment #231818 - Attachment is obsolete: true
Attachment #234423 - Flags: review?(shaver)
(Assignee)

Comment 10

12 years ago
Created attachment 234433 [details] [diff] [review]
Nutch Special Page v2

Localization:

Anything in wgMessageCache shows up in Special:Allmessages, and you can edit it without touching any code. This is the method suggested in the mediawiki documentation for localization of these things, and I think it would work for us since we have a seperate wiki for each language, someone could just edit in the localizations of each of the messages.

I've put every message I output in the MessageCache.

Regarding the list of languages, there doesn't seem to be such a list, certainly not in the MessageCache which is where it should be if there was one. There is kindof one in the Mediawiki skin we're using, but those are urls and it's pretty messy to get them out.

I can add each language to the MessageCache as 'langEnglish' etc, and do it that way, possibly.
Attachment #231826 - Attachment is obsolete: true
Attachment #234433 - Flags: review?(shaver)
(Assignee)

Comment 11

12 years ago
Created attachment 234832 [details] [diff] [review]
Nutch Class Unit Test v2

Changed language tests to assert against each returned link.
Attachment #231821 - Attachment is obsolete: true
Attachment #234832 - Flags: review?(shaver)
(Assignee)

Updated

12 years ago
Attachment #234832 - Flags: review?(shaver)
(Assignee)

Comment 12

12 years ago
Created attachment 234835 [details] [diff] [review]
Nutch Class Unit Test v3

stopped grabbing number of total results, since you can just count how many links were checked
Attachment #234832 - Attachment is obsolete: true
Attachment #234835 - Flags: review?(shaver)
Comment on attachment 231820 [details] [diff] [review]
Load test for the opensearch servlet

Yer whitespace is inconsistent.

Yer "then" is on the same line as yer "if".

And you want to use sleep() or usleep() instead of busy-looping on the poor, defenseless machine you're running on.
Attachment #231820 - Flags: review?(shaver) → review-
Comment on attachment 234423 [details]
Helper Class v2

>        if (!isset($mzNutchURL)) wfDebug("mzNutch Error: $mzNutchURL not set in your global settings file!");

Two lines!

But also, should this not be a more fatal sort of error, perhaps presented to the user like a database error is?

>        // grab XML via curl, and report errors to wfDebug if any
>        $xml = curl_exec($ch);
>        if (curl_errno($ch)) {
>            wfDebug(curl_error($ch));
>        } else {
>            curl_close($ch);
>        }

Similarly here, it'll be a _lot_ easier to diagnose failures to contact the nutch server if there's something spat out to the user.

Neither of these are show-stoppers, though, so r=shaver if you file a follow-up bug on improved diagnostics.
Attachment #234423 - Attachment is patch: false
Attachment #234423 - Flags: review?(shaver) → review+
Comment on attachment 234835 [details] [diff] [review]
Nutch Class Unit Test v3

r=shaver.
Attachment #234835 - Flags: review?(shaver) → review+
Comment on attachment 234433 [details] [diff] [review]
Nutch Special Page v2

>                      <select name=\"language\">
>                      <option value=\"en\" selected>English
>                      <option value=\"ca\" >Catal&#224;
>                      <option value=\"cn\" >Chinese
>                      <option value=\"de\" >Deutsch
>                      <option value=\"fr\" >Fran&#231;ais
>                      <option value=\"it\" >Italiano
>                      <option value=\"ja\" >Japanese
>                      <option value=\"nl\" >Nederlands
>                      <option value=\"pl\" >Polski
>                      <option value=\"pt\" >Portugu&#234;s
>                      <option value=\"es\" >Espa&#241;ol
>                      <option value=\"ko\" >Korean
>                      </select>

I am pleased to report that we have two new languages, Russian and Czech! :)

>        //output final search results
>        $wgOut->addHTML("$xmlstuff");

Don't need the quotes there, just sayin'.

r=shaver
Attachment #234433 - Flags: review?(shaver) → review+

Comment 17

12 years ago
Anything else needs to be done in this bug or can it be closed?
(Assignee)

Comment 18

12 years ago
Indeed it can be closed, my bad.
Status: NEW → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → FIXED
Component: Deki Infrastructure → Other
Product: Mozilla Developer Network → Mozilla Developer Network
You need to log in before you can comment on or make changes to this bug.