Today WordPress announced they are using Sphinx to help developers find plugins faster on their site. Sphnix is an open source search engine for indexing primarily databases and is much faster than the standard full text search feature of most databases. I’ve put together a brief tutorial to help you get started with this application.

I have been using Sphnix for about 6 months now on a couple projects and I have been pleased with the results.

You need to first install the Sphinx engine either on your server or development workstation. The Sphinx site has decent documentation on getting started. You may find that many distributions of Linux already have it as a package, so it may be a simple as “apt-get install sphnix” or “yum install sphnix.”

There are two major components of the configuration the source (what Sphnix should crawl) and the index paths (where Sphinx should store its data). Everything else in the configuration can pretty much be left to the defaults or can be fine tuned later.

Here is an example of a source:

source src3
{
	type					= mysql
	sql_host				= localhost
	sql_user				= db_user
	sql_pass				= db_pass
	sql_db					= intranet
	sql_port				= 3306	# optional, default is 3306
	# mandatory, integer document ID field MUST be the first selected column
	sql_query				= \
SELECT `Paper`.`id` , `Paper`.`title` , `Paper`.`abstract`, `Paper`.`conference`, `Paper`.`keywords`, UNIX_TIMESTAMP( `Paper`.`modified` ) AS `modified`, \
CONCAT(`Author`.`firstname`, " ", `Author`.`lastname`) as `author_name` \
FROM `papers` AS `Paper` \
LEFT JOIN `authors` AS `Author` ON `Author`.`id` = `Paper`.`author_id`
	sql_ranged_throttle	= 0
	sql_query_info		= SELECT * FROM `papers` WHERE id=$id
}

It starts off with what you would expect to see in any web application configuration file, which is the database connection information. It is then followed by a query of data to be included in the search index. Any data item can be included or excluded in the actual search depending on the parameters you pass to the Sphinx engine.

This is then followed by the index, which is where the Indexer stores its data.

index paper_quick
{
	source			= src3
	path			= C:\xampp\sphinx\data\paper_quick
	docinfo			= extern
	mlock			= 0
	morphology		= none
	min_word_len		= 2
	charset_type		= sbcs
	html_strip				= 1
}

Here you can see we referenced the data source and provided a path for it to store its indexed data. We have also chosen to strip out the html (html_strip), so that text formatted in an online HTML editor is ignored. This way things like “href” and “em” aren’t searchable.

Once this is done you’ll need to run the indexer to build the index. This is invoked by the “bin/indexer” command. Run without parameters it will run all indexes, with a specific index specified it will only update that particular index. In a development environment I run the index manually when I need to for testing, but in a production environment I set it up as a scheduled task either in the System Tools in Windows or through Cron in Linux.

#!/bin/sh
/etc/init.d/searchd stop
/usr/local/bin/indexer --all >> /var/log/indexer.log
/etc/init.d/searchd start

Once this is done you can connect to the Sphinx on the port specified in the searchd section of the configuration. The default is 3312.

You can find a number of APIs to connect Sphnix to your web application from their website on the contributors page.

Here is a quick PHP example. Using the search index in the config sections earlier in this post.

<? 
require_once("sphinxapi.php");

$host = '192.168.0.1';
$port = 3312;
$mode = SPH_MATCH_ANY;
$index = 'paper_quick';

$ds = new SphinxClient();
$ds->SetServer($host, $port);
$ds->SetMatchMode($mode);

if(array_key_exists('search', $_REQUEST))
{
   $results = $ds->Query($_REQUEST['search'], $index);
   
   print "We found these matches: ";
   print_r($results['matches']);

   $result_ids = implode(',', array_keys($results['matches']));
}
else
{
?>

<? } ?>

What it will give you is a list of table rows that have matching results and their relevancies to the search term(s). You can then combine this with the actual data in your database using a standard query like this one.

SELECT `Paper`.`id` , `Paper`.`title` , CONCAT(`Author`.`firstname`, " ", `Author`.`lastname`) as `author_name` 
FROM `papers` AS `Paper` 
LEFT JOIN `authors` AS `Author` ON `Author`.`id` = `Paper`.`author_id`
WHERE `Paper`.`id` IN ('.$result_ids.')

It seems like a little extra work, which it is, but if it helps your users find things faster it is well worth the effort.