Friday Repost: Indexing documents with Tika and Grails

The Friday Repost series are copies from my earlier writings. Since I don’t want to loose them, and they might prove useful to others, I’m reposting them on this blog

In one of the previous blogposts, I described a way to easily upload images by using Grails commands. This blogpost further builds on that, by providing a generic way to index uploaded documents. These documents can later be retrieved by using the search option provided by the Grails Search plugin.

To index documents, we’re going to need two components: one of these components is the Grails Search plugin, as described above. The other component is Tika. Tika is a document indexer library which can index many kinds of documents like Text, Word, Excel, Powerpoint, but also more exotic document formats, like MP3 and FLV. Combined with Lucene, this provides a powerful combo.

So, how do we get this to work? Well, first we need the application from the previous blogpost. I refactored it a little to represent the new generic format better. I renamed all references from image to document (so now we have a CreateDocumentCommand instead of a CreateImageCommand, as well as all other references), and I moved the upload.gsp to the views directory, since redirection and forwarding works better that way. After the refactoring was done, I first installed the searchable plugin by typing:

grails install searchable

I also needed Tika, so I added it to the BuildConfig.groovy in the grails-app/conf directory, like this:

Grails and extra XML libraries really don’t mix well, so to prevent some nasty class loading issues, better exclude them here. What we need to do now is to tell searchable to index our documents when we upload them. This is quite trivial really, we can just annotate the Document class (which in the previous application was called ‘Image’) with a Tika converter. Then we have to create the converter, register it in Spring, and we should be done. So, first things first: let’s annotate the Document class.

In the Document class, add a searchable closure with a converter for the byte array, like this:

We now need to create the TikaConverter. We’ll create a TikaConverter.groovy in the src/groovy/com/example/converter directory. The class should look like this:

This class is responsible for extracting text from any kind of document uploaded to the system. To make this converter available, we have to register it in the Spring context, and then we’re ready to go. To register the Converter in the Spring context, add the following lines to grails-app/conf/spring/resources.groovy:

Now everything should be set in place. Browse to: http://localhost:8080/document-indexing/upload/show to upload a document. Any kind of document will do, as long as it contains some text. When the uploading went alright, the document will be available in the Lucene index. You can test this by going to http://localhost:8080/document-indexing/searchable and type in some text from the document you have just uploaded. The result should be a result page similar to Google.