2015 Jul 01

Search seems to always emerge as one of those critical site requirements that can be really tough to deliver on. There's two main points of concern when implementing search: standard content searching, and searching within file attachments like PDF's and Word documents.

In the past, there have been the usual suspects available to handle either of these content types. Solutions such as SearchBlox (Lucene under-the-hood), full-text database searching using MySQL/MSSQL, Apache Solr and various CLI tools like pdftotext or catdoc are all commonly-used players in this space. Having used all of these at some point, none of them were feature-rich nor easy to implement, and left us consistently searching for a better solution.

Apache Solr comes close to being a well-rounded search option and has been used with Drupal for awhile, but it's slowly being eclipsed by other emerging solutions. 

Recently we had the opportunity to evaluate search tools for a new client site, the Canadian Agency for Drugs and Technologies in Health (cadth.ca). They required a robust search solution as it would be the main focus of the new site: with thousands of content pages and PDF reports, get users to the content as quickly and easily as possible.

Through our research, a clear winner emerged in the search index landscape: Elasticsearch. It seemed to hit all of our requirements and, while not used extensively within Drupal in the past, appeared to be the best tool for the job. Top features of Elasticsearch include: a RESTful API (easier than SOAP), JSON payloads (better than using XML for sure), and clear documentation. Lastly, and perhaps most importantly, it is blazing fast: it's the first thing you really notice about it compared to other search tools.

Getting Started with Elasticsearch

The first steps were slow to start but got a lot faster once we dove deeper. Below are their Quick Start steps for an up-and-running setup.

elasticsearch quick setup

After getting used to it on the CLI, we wanted to create a basic proof of concept search to make sure it had all the features we needed. This is always a good start whenever you're trying to learn or test something out. As an aside for development in general, it's really important to isolate new or untested functionality from the main project code when researching or testing. A lot of the time, you'll see solutions being built in place without any prior siloed example of something working on it's own.

Here's a basic high level look at my test search:

test search page

We wanted to confirm the following, with an example of each:

  •  pagination 
  •  a way to pull indexed fields
  •  search by a specific exact value (say a taxonomy term = 107856)
  •  keyword search with nGrams for partial matching
  •  a way to sort on field values 

Once those things were confirmed, we had peace of mind knowing that the basic code used in that test search could be ported to Drupal as the base for the new search interface.

The Code Perspective

Here's a quick high-level for how this is working from a code perspective. The search parameters are formed using PHP arrays. This is super easy and is the way you always interact and build queries but can be cumbersome when trying to build larger queries. You end up getting lost in array land. Below is an array built to do a keyword search on the "documents" index for the term "test".

<?php
include 'vendor/autoload.php';
$client = new Elasticsearch\Client();

$keywords = isset($_GET['q']) ? trim($_GET['q']) : '';
$sort_field = isset($_GET['sort']) && $_GET['sort'] != '' ? explode('_', $_GET['sort'])[0] : '';
$sort_order = isset($_GET['sort']) && $_GET['sort'] != '' ? explode('_', $_GET['sort'])[1] : '';
$cur_page = isset($_GET['page']) ? $_GET['page'] : 1;

$searchParams['index'] = 'docs';
$searchParams['type'] = 'docs';
//pagination
$amount_per_page = 10;
$searchParams['size'] = $amount_per_page;
$searchParams['from'] = ($cur_page) ? $cur_page * $amount_per_page - $amount_per_page : 0;

if($keywords){
    $searchParams['body']['query']['filtered']['query']['multi_match']['fields'] = array('title', '_all');
    $searchParams['body']['query']['filtered']['query']['multi_match']['query'] = $keywords;
}else{
    $searchParams['body']['query']['match_all'] = [];
}

if($sort_field && $sort_order){
    $searchParams['body']['sort'][$sort_field] = array('order' => $sort_order);
}
$result = $client->search($searchParams);

Next up is the result. I've only show a snippet here as each record is quite verbose. It found 1998 results for "test" out of the 10,000+ records. You can also see the document field data stored as Base64 encoded.

Array
(
    [took] => 3
    [timed_out] => 
    [_shards] => Array
        (
            [total] => 1
            [successful] => 1
            [failed] => 0
        )

    [hits] => Array
        (
            [total] => 1998
            [max_score] => 6.162046
            [hits] => Array
                (
                    [0] => Array
                        (
                            [_index] => docs
                            [_type] => docs
                            [_id] => 81249
                            [_score] => 6.162046
                            [_source] => Array
                                (
                                    [id] => 81249
                                    [attachments_field_document] => CkNBTkFESUFOIENPT1JESU5BVElORwpPRkZJQ0........
                                    [author] => 0
                                    [body:summary] => 
                                    [body:value] => 
                                    [created] => 1041397200

Connecting Elasticsearch and Drupal

The next phase was learning how data could be ferried over to Elasticsearch from Drupal. We had previously used the Search API module which has many different backends available. We had heard mention of Elasticsearch being one of them and started investigating the options.

The two that are available are "Elasticsearch" and "Elasticsearch Connector". We ended up choosing the connector module because:

  • It is using the official Elasticsearch PHP library
  • It was better supported at the time and had more downloads 
  • It has a pending Drupal 8 release in the works
  • It provided more of an ecosystem for Elasticsearch within Drupal rather than a simple connection for search as explained here.

The final piece for indexing data was for the document files themselves. We were being presumptuous at first thinking Elasticsearch had this built in. It turns out it's more of an afterthought than a main feature. You need to install a plugin called "Mapper Attachments". It uses the Apache TIKA library to parse documents as stored base64 encoded content. By default, 100000 characters are extracted when indexing the content. We've left that value as-is, as it is more than enough to allow for support during searches.

A list of modules and other dependencies for the search setup:

  • Elasticsearch Connector. Chosen for it's use of the official PHP library provided by Elasticsearch. Also to tie in with the Search API module we turned on the "Elasticsearch Connector Search API" sub-module.
  • Apache TIKA jar file. This is what is used by the Search API Attachment module for indexing files.
  • Search API. Provides all the backend interface to control the index.
  • Search API Attachment. This module is an add-on to the Search API module which allows the indexing and searching of attachments. We are also using a submodule of this called "Search API Attachments Field Collections".
  • Search API base64 encoded Attachments. This is a custom module that hooks into Search API as a data alteration callback. It adds base64 encoding to files being indexed (so it can work with Elasticsearch mapper attachments plugin). By default, search API attachments will store the files straight up with no encoding. Elasticsearch wants the data coming in as Base64 so we need this to intercept the data and encode it before it gets to Elasticsearch.
  • Search API Field Analyzer. A custom module that adds analyzers to certain fields that Elasticsearch needs to do nGram type searches on. It does this at the time of index creation. It would be nice to see search API contain this functionality but for now to get any advanced indexing options, we needed to stick them in a module that runs on index creation time. This becomes problematic when moving the site as Drupal thinks things are already indexed on top of a new server that doesn't currently hold an index. When deleteing an index from within Drupal, all the field mappings and settings have to be re-done as well, which doesn't make it ideal when migrating the site. This all has to be done to re-trigger the hooks in the modules that hold the code for adding those analyzers. There might be a better way to handle this but at the time it was the best way to do it.

The Final Result

Here's a look at the final product. Notice similaries from the above test search screenshot? 

final example of elasticsearch search page

It ended up having more layers to it such as being ajax-ified and having the ability to save searches, but on the whole it's a pretty straightforward layout. The front-end uses basic forms that get serialized on change by the user and then has that data passed to a file that handles the GET requests to Elasticsearch.

In the end, Drupal handles the passing of data to Elasticsearch and the front-end basically provides the templating needed to display the custom forms and search results like it would with any other type of page. This is an ideal setup: if things need to be tweaked at a low level (and they always do) they can be without needing much, if any, Drupal-related development work. This is a similar mindset to, for example, needing a Twitter feed on your site. 

In conclusion, Elasticsearch is a great tool if you need search, period. There are a host of monitoring and administrative tools Elastic.co makes available as well. Elasticsearch is built on top of Apache Lucene, similar to Solr. So you're still getting the familiar storage engine, but with all the nice features that make it nice to interact with.