Using Lucene.NET with SharePoint 2007, Part 1

In a customer project we ran into a requirement we thought would be easily solvable with SharePoint 2007: the ability to search items in a list with wildcards.

Well, you can. But there’s a catch.

After some investigation we found out that you actually can do wildcard searches (either by doing some tweaking of the standard search results page in SharePoint, for instance by introducing some webparts that can be found on codeplex). But there is a limitation. It seems that SharePoint only supports searching with wildcard prefixes.

So I can search for “ShareP*”. But I can’t search for “Sha*Point”. This seems to be due to the fact that they use the MSSearch.exe application under the hood (please correct me if I’m wrong) and that seems to be limited in its ability to perform wildcard searches. So we had to find another solution.

We went for Lucene.net which is the .NET version of the open-source Apache Lucene project.

Lucene.net is extremely simple to incorporate in your project (it’s just one dll to add) and it’s very simple to implement. We build two components, a crawler that populates the Lucene index, and a web part that consumes the Lucene index and performs searches on it. This post will be about the crawler

The crawler

In order to successfully populate the index we needed to run a crawler/indexer at certain intervals. For this we created a SharePoint timerjob. The timer job goes out to preconfigured locations, works its way through the items, and add the fields we need to search on to the index. Per list item we also added some none-searchable fields so we can find the full list item back at a later stage (site id, web id, list id and unique id of the list item). Have a look at a part of our code. I added comments to clarify things.

public override void Execute(Guid contentDbId)
{
    // Get a reference to the current site collection's content database
    SPWebApplication webApplication = this.Parent as SPWebApplication;
    SPContentDatabase contentDb = webApplication.ContentDatabases[contentDbId];

    // Retrieve some properties we added to the property bag of the
    // webApplication that define which list in which site to index.
    // The properties we set in a receiver class that gets executed
    // the moment we activate the feature. The properties in the
    // feature are read from a property tag in the feature.xml
    // file: IDXSearchIndexPath" Value="c:\\temp\\index"/>
    // The IDXSearchSite property contains the name of the site including
    // the path: e.g. sites/test
    string searchSite = webApplication.Properties["IDXSearchSite"].ToString();
    // The IDXSearchList property contains the name of the list we want
    // to index.
    string searchList = webApplication.Properties["IDXSearchList"].ToString();
    // The IDXSearchIndexPath property contains the full path where we
    // want to store the search index. e.g. c:\\searchindex
    string indexPath = webApplication.Properties["IDXSearchIndexPath"].ToString();
    // We stored the fields as a property in the feature.xml file
    // as a comma separated list, like "field1,field2,field3"
    string[] fields = webApplication.Properties["IDXSearchFields"].ToString().Split(new char[] { ',' });
    // In this situation we assume that the list is located in
    // the root web
    SPList list = contentDb.Sites[searchSite].RootWeb.Lists[searchList];

    // Set up the Lucene objects we need index the list
    Analyzer analyzer = new StandardAnalyzer();
    // We store the index on a local disk. If you want to just test it
    // you could also use a RAMDirectory object, which lives in memory
    Lucene.Net.Store.Directory directory = FSDirectory.GetDirectory(indexPath, true);
    // Delete the existing index. If you want to do this, that
    // is totally up to your requirements and needs.
    foreach (string f in directory.List())
    {
        directory.DeleteFile(f);
    }
    // Now we create an IndexWriter object that allows us to
    // populate the index with our data
    IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
    iwriter.SetMaxFieldLength(25000);
    string listid = list.ID.ToString();
    string siteid = list.ParentWeb.Site.ID.ToString();
    using(SPWeb web = list.ParentWeb)
    {
        // Loop through all the items
        foreach (SPListItem item in list.Items)
        {
            // Create a new Lucene document
            Document doc = new Document();
            // Add some non-indexed fields to the document that
            // we can use later on to find the correct list item
            doc.Add(new Field("idx_uniqueid", item.UniqueId.ToString(), Field.Store.YES, Field.Index.NO));
            doc.Add(new Field("idx_listid", listid, Field.Store.YES, Field.Index.NO));
            doc.Add(new Field("idx_siteid", siteid, Field.Store.YES, Field.Index.NO));
            // Brute force and ugly way to index all text
            foreach (string field in fields)
            {
                try
                {
                    doc.Add(new Field(field, item[field].ToString(), Field.Store.YES, Field.Index.TOKENIZED));
                } catch (Exception e)
                {  }
            }
            iwriter.AddDocument(doc);
        }
        iwriter.Close();
        directory.Close();
    }
}

The code snippet above comes from the timer job class we created, and will be run every time the timer executes. It’s up to you to decide how often this is needed. You can also decide to update the index on a need to basis, by adding an event receiver to a list that inserts new items into the index when needed.

This code example does not index attachments. However, you can index attachments, making use of iFilter objects, we do use them in our actual solution. It goes beyond the scope of this post to address that right now, and I might go into that in a follow up post.

Part 2 of this post will be about consuming the index in a web part and actually performing wild card searches.

SharePoint or not?

If you’re working with SharePoint for a long time, you’ll notice a pattern that shows up in many projects:

sales: Sure we can do it in SharePoint!
project manager: Or can we?
developer: Hmmm, I think we need a relational database solution for this.
architect: Do we really? I’m not so sure there.
developer: Yes we do. And SharePoint is not relational, so we can’t do it.
architect: I’m sure we can. We just need to rethink our approach a bit.

Coming (I’ll admit, years ago) from a Domino background, this is not something new to me. These kind of conversations usually happen when we get a customer request for a solution that is stepping out of the ‘out of the box’ boundaries of SharePoint or when we are asked to migrate an existing solution to SharePoint.

When stepping out of the OOTB boundaries, it’s not uncommon that you are working with a specification from our customer, or even have some screen shots / demo builds to work with. And it’s again also not uncommon that the customer comes from a relational database background. Users came to depend on the fact that you can build relationships within data, and that’s how they think nowadays. They made their own MS-Access databases maybe even, proving their approach, and that’s what are presented with.

Then the specialist comes in telling them that SharePoint is following a non-relational approach. That everything we store is like a file in a folder (read: an item in a list). And that although you can refer from one item to the other, when you remove one of the items, the reference does not resolve anymore, and it’s not even notifying us of this (notice: this is being taking care of in SharePoint 2010, which does allow relationships between items in a list and also maintains/enforces them).

There are 2 ways to solve the customer request: either you go for a relational database solution or your re-think your approach. Do we really need to maintain relationships in our data? Well, sometimes you really do, but it’s also not uncommon that you really don’t. In some solutions it’s perfectly acceptable to copy those little snippets of data that you need to ‘refer’ to actually into the item that you are storing. Only if you need to be able to dynamically update those added items in a later stage, well, then it’s a bigger challenge in SharePoint.

And that’s where the re-thinking comes in. You have to look at the base problem, not at the customer solution provided to you. Do that, and SharePoint is a very viable solution in many cases. Even if you decide to go for a relational database solution, SharePoint still adds to the value. Easy deployment of solutions, access management which can be done by end users instead of administrators, making solution more scalable by adding servers to the farm, etc.

Note: SharePoint 2010 solves many of the challenges and also allows for solutions where you actually don’t have to re-think the approach, more of that in upcoming posts.

Follow

Get every new post delivered to your Inbox.