« Back
in SearchEngine Lucene C# .NET read.

Better Search Functionality with Lucene.NET.

I've been on a Big Data and machine learning ... learning spree lately and ran across Apache Solr. Apache Solr is a whole application that lets you power search functionality for many things. It has data importers for a lot of file types from XML, CSVs and even connectors to database sources.

Underneath the hood, Solr runs on Lucene which is an Apache top level project. You can use Lucene by itself if you needed custom functionality and low level access. The most important thing is that it's fast, blindingly so. Fortunately, getting started on it is very easy and that's what I'm going to dive into today.

Simple Ways of Search

Before using Lucene, how would I have thought about doing simple searches for some of my data sets?

  • In .NET probably do a .contains(searchTerm) on the relevant fields.
  • Possibly do SQL stuff e.g. like '%searchTerm%' or even a series of :
...
     OR {var1} like '%word1%'
     OR {var2} like '%word2%'
...

Then I would have to consider things like dealing with stop words...and so on and so forth.

Lucene To The Rescue

Lucene has a bunch of neat features in it including dealing with stop words, white spacing and all sorts of things. It can take an O(n) search to an O(1) search. There is some basic information about Lucene that we can cover first:

  1. Lucene can store its data in a database but the documentation actually finds this slower.
  2. It builds an index in binary files in a folder.
  3. The Document type is a top level type in the library. Building an index and retrieving it means retrieving these Documents.
  4. The Field type is also a top level type. A Document has many Field types in it. This usually corresponds to what we'll be searching and letting Lucene run its analyzers and optimizers on.
  5. There are several Analyzer types which is a way for tokenizing the words. I'll be using the StandardAnalyzer today.

You can import Lucene.NET into your project with:
Install-Package Lucene.Net

Standard Lucene Service Class

I'm going to make this class as generic as possible so that I can reuse it. Each class requires that we pass in a string of a file path where the application will write to. Make sure that the application has permissions to write to that folder.

public class LuceneIndexer  
   {
       private Analyzer _analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
       private string _indexPath;
       private Directory _indexDirectory;
       private IndexWriter _indexWriter;       

       public LuceneIndexer(string indexPath) 
       {
           this._indexPath = indexPath;
           _indexDirectory = new MMapDirectory(new System.IO.DirectoryInfo(_indexPath));            
       }
   }

Pretty straight forward. Note that there are different Directory types. I chose this one because the documentation indicates that it has better concurrency performance.

Building the Index

This snippet will build the entire index. Note that as you add more entities or searchable items to the system, you should not build the complete index each time. Instead, write an update function that removes the old item with the new.

public void BuildCompleteIndex(IEnumerable<Document> documents)  
{
    _indexWriter = new IndexWriter(_indexDirectory, _analyzer, IndexWriter.MaxFieldLength.UNLIMITED);

    foreach(var doc in documents)
    {                
        _indexWriter.AddDocument(doc);
    }    
    _indexWriter.Optimize();
    _indexWriter.Flush(true,true,true);
    _indexWriter.Close();
    _indexWriter.Dispose();
}

To build the index, we have to pass it a collection of Document type. I excluded it from this class because I wanted custom documents to be created by whatever is using it. The thing with a document is that the fields can be named something else completely from your domain model. In fact, I would even create a custom model for it much like an MVC view model. An example of building a collection of Documents would be:
MovieIndexer.cs

...
var indexer = new LuceneIndexer(@"C:\Temp\Lucene");  
indexer.ClearIndex();

var list = new List<Movie>();  
list.Add(new Movie{id=1,content="The Simpsons"});  
list.Add(new Movie{id=2,content="Simpsons the Movie"});  
list.Add(new Movie{id=3,content="The Little Rascals"});  
list.Add(new Movie{id=4,content="Terminator: Salvation"});  
list.Add(new Movie{id=5,content="Terminator 3: Rise of the Machines"});  
list.Add(new Movie{id=6,content="The Terminator"});

var documents = new List<Document>();

foreach (var item in list)  
{
    var doc = new Document();
    doc.Add(new Field("id", item.id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("description", item.content, Field.Store.YES, Field.Index.ANALYZED));                
    documents.Add(doc);
}

indexer.BuildCompleteIndex(documents);  
...

This is an in memory collection of movies. Notice that one of the fields is analyzed and the id field is not. We can still return both fields on the search. Having an id is pretty important as once we have the id, we might want to use it to fetch more information about the Movie.

Single Field Search

Alright so now how do we search for the movie?

public IEnumerable<Document> Search(string searchTerm,string searchField, int limit)  
{
    var searcher = new IndexSearcher(_indexDirectory);
    var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, searchField, _analyzer);
    var query = parser.Parse(searchTerm);
    var hits = searcher.Search(query,limit).ScoreDocs;

    var documents = new List<Document>();
    foreach (var hit in hits)
    {
        documents.Add(searcher.Doc(hit.Doc));
    }

    _analyzer.Close();
    searcher.Dispose();
    return documents;
}

Huh? What's a ScoreDoc? It's a score assigned to the document that Lucene decided based on its algorithms. Here's a link if you need to find out more: Lucene - Scoring.

We loop through each ScoreDoc now known as a singular hit and use the searcher to pull the actual document out.

And just as we converted items to a Document on the way in, we need to convert it back to a Movie in wherever you're calling it from. In my case it will be a Web API 2:

var results = new List<Movie>();

var query_results=indexer.Search(term, "description", 100);

foreach(var movie in query_results)  
{
    results.Add(new Movie { id = Convert.ToInt32(movie.Get("id")), content = movie.Get("description") });
}

Doing A Search Web API

I won't go through the details of creating the Web API controller but here are the results when doing a RESTful call:

[
 {"id":6,"content":"The Terminator"},
 {"id":4,"content":"Terminator: Salvation"},
 {"id":5,"content":"Terminator 3: Rise of the Machines"}
]

And when I put it in the browser:
Lucene Single Term Search Result

Here's the entire LuceneIndexer.cs file:

 public class LuceneIndexer
{
    private Analyzer _analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
    private string _indexPath;
    private Directory _indexDirectory;
    private IndexWriter _indexWriter;


    public LuceneIndexer(string indexPath) 
    {
        this._indexPath = indexPath;
        _indexDirectory = new MMapDirectory(new System.IO.DirectoryInfo(_indexPath));            
    }

    public void BuildCompleteIndex(IEnumerable<Document> documents)
    {
        _indexWriter = new IndexWriter(_indexDirectory, _analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        foreach(var doc in documents)
        {                
            _indexWriter.AddDocument(doc);
        }

        _indexWriter.Optimize();
        _indexWriter.Flush(true,true,true);
        _indexWriter.Close();
        _indexWriter.Dispose();
    }

    public int UpdateIndex(IEnumerable<Document> documents)
    {
        throw new NotImplementedException();
    }

    public void ClearIndex()
    {
        _indexWriter = new IndexWriter(_indexDirectory, _analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        _indexWriter.DeleteAll();
        _indexWriter.Close();
        _indexWriter.Dispose();                    
    }


    //Single field search
    public IEnumerable<Document> Search(string searchTerm,string searchField, int limit)
    {
        var searcher = new IndexSearcher(_indexDirectory);
        var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, searchField, _analyzer);
        var query = parser.Parse(searchTerm);
        var hits = searcher.Search(query,limit).ScoreDocs;

        var documents = new List<Document>();
        foreach (var hit in hits)
        {
            documents.Add(searcher.Doc(hit.Doc));
        }

        _analyzer.Close();
        searcher.Dispose();
        return documents;
    }

    //Allows multiple field searches
    public IEnumerable<Document> Search(string searchTerm,string[] searchFields, int limit)
    {
        var searcher = new IndexSearcher(_indexDirectory);
        var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchFields, _analyzer);
        var query = parser.Parse(searchTerm);
        var hits = searcher.Search(query, limit).ScoreDocs;

        var documents = new List<Document>();
        foreach (var hit in hits)
        {
            documents.Add(searcher.Doc(hit.Doc));
        }

        _analyzer.Close();
        searcher.Dispose();
        return documents;
    }

}

I hope that helps you on your way to building better searches.

comments powered by Disqus