2017 Feb 14

Typically with Elasticsearch or SOLR, the indexed data is meant to be treated as being ephemeral. In other words, the indexes stored within are not permanent and are setup for use of their speed and searchability, but ultimately understood that they are re-built often. Typically when using these types of technologies, your data's system of record is usually stored within a proper database like PostgreSQL or MySQL. An example being a CMS such as Drupal which uses a MySQL database. All the content is created and maintained within that system and then is ultimately stored in it's underlying database, which is permanent. To leverage a better search experience however, we shuttle that data over to a service like elasticsearch and interface with it to build out a search interface to get the speed and searching features that we need from our data. This is exactly the setup we use typically when doing an advanced search as part of a website.

Tic-toc goes the docs

Normally Elasticsearch functions without needing much intervention but sometimes the index becomes compromised and needs to be rebuilt. Usually this operation is fairly quick and the site can continue on like normal. The problem is if you have a large amount of data that takes time to build into the elastic index. If the index were to go down, it's generally unacceptable to have the search unresponsive during that time. One such setup we have is a Drupal site using Elasticsearch which we use to ferry over thousands of pages worth of content. In addition to this, there are PDF documents that also get indexed as part of the page. The underlying process is using Apache Tika for that and for each document, the tika jar app is invoked to pull out the contents of the document. The page indexing itself can take some time, but piling on the need to index documents as well really slows things down. So much so that it generally takes hours to fully build up the Elasticsearch index to mirror the content that's stored in Drupal. It's not something that generally has to happen often but when it does, it would be nice to have a version of the existing index in place for the site to use while we rebuild it in the background from Drupal. That way the search will still work and it won't be obvious to the users that a rebuild operation is being done in the background.

Snapshot and Restore

Enter the built in Elasticsearch module called "Snapshot and Restore". Basically we wanted a way to bring the site search back up quickly in case of index failure. This either involved maintaining parallel indexes or some sort of backup strategy so we could restore an index from say, last night or early this morning. After some research, we were pleased to find that Elastic had a solution for the latter built in. There were also some third party snapshot/backup tools and scripts available but it was decided that using the built in stuff directly would probably be safer. Some examples were elasticsearch-exporter, elasticdump and various python scripts that were found that basically all use the underlying snapshot and restore functionality it would seem. Rather than be tied to any of those, building up some shell scripts around CURL commands using the REST endpoints was the approach that was taken.

REST in peace

Since Elasticsearch is RESTful by nature, these snapshot commands are just some additional endpoints that can be used to backup and restore indexes. There is some setup though to define the backup location. We'll start there.

Define the backup repository

More information can be found about setting up the repository in the elastic docs, but the basic elements are:

  1. Define the backup repository in /etc/elasticsearch/elasticsearch.yml. There's already a paths section in there so it's a good idea to put the new parameter under there. It's defined as an absolute path. You can define multiple locations using an array-like syntax. It will end up looking like this path.repo: ["/vagrant", "/vagrant/src"]
  2. Then you'll need to create a new repository to back up to. Here the repo is being called "docs_backup", but can be any name you choose:
    curl --silent -XPUT localhost:9200/_snapshot/docs_backup -d
    {
        "type": "fs", 
        "settings": {
            "location": "/vagrant/esbackup"
            
        }
    }

Creating the backup

Now that there's a repository setup for backups, we can run a backup. This assumes you already have an existing index to backup. The example given is for an index called "docs". The name of the snapshot is called "snapshot" in this case. Again, you could call this anything:

curl --silent -XPUT localhost:9200/_snapshot/docs_backup/snapshot -d
{
    "indices": "docs"
}

Note, the --silent flag for curl is being used here to get cleaner output from the command. It won't show the progress meter or error messages.

What this is doing, is creating a physical representation of the index on the servers filesystem. The index directory has the snapshot file containing the JSON representation of the index, and the folders all contain lucene binary files representing the data. 

root index folder

The root of the backup. The snapshot-snapshot file contains the JSON representation of the index

backup index directory view

Files representing the exported data

Restoring

We have two options to restore. The most basic one is to just replace the current index with a backup. The second is to make a copy on restore with a different name. The nice thing about the second option is a new index can be made using the restored data. This allows for fixing the current broken one while using the newly restored index.

  1. For the first case, you'll need to close the running index first curl -XPOST localhost:9200/docs/_close Then we can run the restore. The index will re-open upon completion.
  2. For the second variant (making a copy on restore), we just need to add some extra params "rename_pattern": "docs" and "rename_replacement": "restored_docs" 

Here's what it looks like in full using the rename method. docs_backup is the name of the repo and "snapshot" the name of the snapshot we're going to use.

curl -XPOST localhost:9200/_snapshot/docs_backup/snapshot/_restore
{
    "indices": "docs",
    "rename_pattern": "docs", 
    "rename_replacement": "restored_docs" 
}

Conclusion

Given this setup, we now have a way to quickly restore a version of the index in case it goes down. Then after getting it back up, we can go in and take the 4 hours to rebuild the index that's hooked up to the CMS. Ultimately the backup should be a script that runs on a schedule at a given interval each day. We've set it up that way and to only have one master snapshot. Elasticsearch does support incremental snapshots, just like a computer backup tool might do, but it's generally not needed if you just need "a good working recent copy" to be up in use while you fix the real index. Also, in the case of the rename on restore, that really is what helps here. By keeping a main index called "docs" in this case, a restore can be done against a new index called "restored_docs". The site search config on the front end can be quickly changed to use that index while "docs" is being rebuilt. Then, the site can be switched back to use "docs" once it's in good shape again.

The other use case for this technology is to move a large index between environments without needing to rebuild. An example being from local to staging sites. If it's something that's done often, the backup repository can be set up identically on stage and then it's just a matter of compressing the backup directory, transferring the file, unpacking and then restoring.

Here are some further resources on the topic of snapshot and restore within Elasticsearch. Also, attached are some shell scripts that have been written to help out with backing up and restoring.

Here's the shell scripts built up around the Elasticsearch API. The backup one can/should be setup on a schedule.