Using Lucene.NET with SharePoint 2007, Part 1

In a customer project we ran into a requirement we thought would be easily solvable with SharePoint 2007: the ability to search items in a list with wildcards.

Well, you can. But there’s a catch.

After some investigation we found out that you actually can do wildcard searches (either by doing some tweaking of the standard search results page in SharePoint, for instance by introducing some webparts that can be found on codeplex). But there is a limitation. It seems that SharePoint only supports searching with wildcard prefixes.

So I can search for “ShareP*”. But I can’t search for “Sha*Point”. This seems to be due to the fact that they use the MSSearch.exe application under the hood (please correct me if I’m wrong) and that seems to be limited in its ability to perform wildcard searches. So we had to find another solution.

We went for Lucene.net which is the .NET version of the open-source Apache Lucene project.

Lucene.net is extremely simple to incorporate in your project (it’s just one dll to add) and it’s very simple to implement. We build two components, a crawler that populates the Lucene index, and a web part that consumes the Lucene index and performs searches on it. This post will be about the crawler

The crawler

In order to successfully populate the index we needed to run a crawler/indexer at certain intervals. For this we created a SharePoint timerjob. The timer job goes out to preconfigured locations, works its way through the items, and add the fields we need to search on to the index. Per list item we also added some none-searchable fields so we can find the full list item back at a later stage (site id, web id, list id and unique id of the list item). Have a look at a part of our code. I added comments to clarify things.

public override void Execute(Guid contentDbId)
{
    // Get a reference to the current site collection's content database
    SPWebApplication webApplication = this.Parent as SPWebApplication;
    SPContentDatabase contentDb = webApplication.ContentDatabases[contentDbId];

    // Retrieve some properties we added to the property bag of the
    // webApplication that define which list in which site to index.
    // The properties we set in a receiver class that gets executed
    // the moment we activate the feature. The properties in the
    // feature are read from a property tag in the feature.xml
    // file: IDXSearchIndexPath" Value="c:\\temp\\index"/>
    // The IDXSearchSite property contains the name of the site including
    // the path: e.g. sites/test
    string searchSite = webApplication.Properties["IDXSearchSite"].ToString();
    // The IDXSearchList property contains the name of the list we want
    // to index.
    string searchList = webApplication.Properties["IDXSearchList"].ToString();
    // The IDXSearchIndexPath property contains the full path where we
    // want to store the search index. e.g. c:\\searchindex
    string indexPath = webApplication.Properties["IDXSearchIndexPath"].ToString();
    // We stored the fields as a property in the feature.xml file
    // as a comma separated list, like "field1,field2,field3"
    string[] fields = webApplication.Properties["IDXSearchFields"].ToString().Split(new char[] { ',' });
    // In this situation we assume that the list is located in
    // the root web
    SPList list = contentDb.Sites[searchSite].RootWeb.Lists[searchList];

    // Set up the Lucene objects we need index the list
    Analyzer analyzer = new StandardAnalyzer();
    // We store the index on a local disk. If you want to just test it
    // you could also use a RAMDirectory object, which lives in memory
    Lucene.Net.Store.Directory directory = FSDirectory.GetDirectory(indexPath, true);
    // Delete the existing index. If you want to do this, that
    // is totally up to your requirements and needs.
    foreach (string f in directory.List())
    {
        directory.DeleteFile(f);
    }
    // Now we create an IndexWriter object that allows us to
    // populate the index with our data
    IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
    iwriter.SetMaxFieldLength(25000);
    string listid = list.ID.ToString();
    string siteid = list.ParentWeb.Site.ID.ToString();
    using(SPWeb web = list.ParentWeb)
    {
        // Loop through all the items
        foreach (SPListItem item in list.Items)
        {
            // Create a new Lucene document
            Document doc = new Document();
            // Add some non-indexed fields to the document that
            // we can use later on to find the correct list item
            doc.Add(new Field("idx_uniqueid", item.UniqueId.ToString(), Field.Store.YES, Field.Index.NO));
            doc.Add(new Field("idx_listid", listid, Field.Store.YES, Field.Index.NO));
            doc.Add(new Field("idx_siteid", siteid, Field.Store.YES, Field.Index.NO));
            // Brute force and ugly way to index all text
            foreach (string field in fields)
            {
                try
                {
                    doc.Add(new Field(field, item[field].ToString(), Field.Store.YES, Field.Index.TOKENIZED));
                } catch (Exception e)
                {  }
            }
            iwriter.AddDocument(doc);
        }
        iwriter.Close();
        directory.Close();
    }
}

The code snippet above comes from the timer job class we created, and will be run every time the timer executes. It’s up to you to decide how often this is needed. You can also decide to update the index on a need to basis, by adding an event receiver to a list that inserts new items into the index when needed.

This code example does not index attachments. However, you can index attachments, making use of iFilter objects, we do use them in our actual solution. It goes beyond the scope of this post to address that right now, and I might go into that in a follow up post.

Part 2 of this post will be about consuming the index in a web part and actually performing wild card searches.

Advertisement

4 Responses to Using Lucene.NET with SharePoint 2007, Part 1

  1. Peter Ang says:

    Where is the link to Part 2 of the post ? Thanks

    • It will come :-) Been a bit busy and abroad the last couple of days. This week I will post the next part on how to write the webpart to consume the index.

  2. ariyo132 says:

    Very nice post! This is what I’ve been looking for my project. BTW, beside indexing attachments as you mentioned, is there anything else that we should take care of to implement a ‘good’ search in a Sharepoint site?

    • It’s important to realize that if you implement the crawler as a timer job, the job will run on the ‘application server’ (that is, if you have farm with more than 1 server). The web parts or whatever you use to consume the index will probably run on one of the front-end servers, which means that they need access to the index. You can solve that with a network share if you want. It’s usually only the crawler that needs write access to the index.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.