Search is omnipresent these days, from the moment we type a set of keywords into our favourite search engine to find a webpage we are looking for to the moment we type a name and expect our email client to find all the emails sent by that person. Both these processes are based on years of research and experimentation in the field of Information Retrieval in order to efficiently being able to find the most relevant documents.
This blogpost will show how to set up Elasticsearch, one of the best and most popular search engines (with Solr being the other main alternative). Its main characteristic is to allow unbelievable scalability and advance querying and indexing capabilities with minimum engineering effort. In addition to this, I will also shown how to perform some basic operations using elastisch, a fantastic library for elasticsearch written in Clojure.
Installation
In order to install our desired version of the system, we can go to the elastic website and download a zip file. I am using the latest stable version (2.3.1). After this, we can unzip the file in any location, run ./bin/elastisearch and we have an elasticsearch environment working and running. You can check that everything is correct just by hitting localhost:9200 (the default port where the system is running) and the output should be similar to the following:
This has created a system with one node (server) but no indexes nor shards have been created yet. So, our next step would be to create an index and start injecting documents into it. I am not going to explain a lot of the details of elasticsearch itself (e.g., nodes, shards, …). Some good resources for looking into the topic are listed bellow:
- Shay Banon – ElasticSearch: Big Data, Search, and Analytics
- Elastic Search from the bottom up
- Quora: What are the best resources to master Elasticsearch
Adding data
Elasticsearch is designed to work vis Rest API which makes its management quite simple. For instance, in order to create a new index, the curl request shown below will suffice. You can find more information about the capabilities related to indices in the documentation.
curl -XPUT 'http://localhost:9200/test/'
This will create a new index named test using the default configuration. We can then add and search some documents:
curl -XPUT 'http://localhost:9200/test/article/1' -d '{ "title" : "Article 1", "date" : "2015-12-13T14:11:19", "message" : "Testing the system" }'
The previous command will create our first document, of type article (types are similar to tables in traditional DB systems and it helps to abstract different data types), in the index test using the id 1, and the system will return the following information confirming the insert:
{"_index":"test",
"_type":"article",
"_id":"1",
"_version":1,
"_shards": {"total":2,
"successful":1,
"failed":0},
"created":true}
Obviously, we wouldn’t like to add documents one by one. For this reason, elasticsearch has a bulk operation capability. When using curl, we can use a text file as a source of the bulk commands to be applied. In our case, we have created a requests.txt file as shown bellow:
{ "index" : { "_index" : "test", "_type" : "article", "_id" : "2" } } { "title" : "article2", "date" : "2015-04-13T14:11:19", "message": "Test"} { "index" : { "_index" : "test", "_type" : "article", "_id" : "3" } } { "title" : "Third", "date" : "2015-11-13T14:12:19", "message" : "Other"}
curl -s -XPOST localhost:9200/_bulk --data-binary "@requests.txt"
Applying the simple command shown above, each one of the bulk commands will be applied. In this particular example, the response from the system, confirming that everything was correct, is the following:
{"took":8, "errors":false, "items":[{"index":{"_index":"test","_type":"article","_id":"2", "_version":1, "_shards":{"total":2,"successful":1,"failed":0}, "created":true, "status":201}}, {"index":{"_index":"test","_type":"article","_id":"3", "_version":1, "_shards":{"total":2,"successful":1,"failed":0}, "created":true, "status":201}}]}
We have now three documents in our index and we might want to search them using keywords. For instance, we want to retrieve those mentioning the word “third”:
curl -XGET 'http://localhost:9200/test/article/_search?q=third'
As expected, this query only returns the last document we inserted into elastic search:
{"took":35,"timed_out":false,
"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":1,"max_score":0.375,
"hits":[{"_index":"test","_type":"article","_id":"3",
"_score":0.375,
"_source":{ "title" : "Third",
"date" : "2015-11-13T14:12:19",
"message" : "Other"}}]}}
I understand that indexing three documents does not sound that exciting, but the foundations and principles are exactly the same and elasticsearch could easily scale up to millions of articles. Furthermore, these examples using curl are good to explain the main concepts, but this is probably not the way you will use Elasticsearch in a real environment. The next section will show how to recreate the same steps I have just shown with curl, using Clojure.
Programatic access with Clojure
This part of the blogpost will assume that we have downloaded the elasticsearch files and we have started a completely empty cluster locally as explained in the previous sections. For those of you who have still an index in the system, you can delete our test index with the following command in order to start with a clean environment (note: be very careful when deleting indices):
curl -XDELETE 'http://localhost:9200/test/'
The library I am going to use is elastisch, which provides a very clear Clojure interface to most of the elastic endpoints. In this case, we are just going to recreate the same steps we have done with curl, but I am intending to write more advance blogposts on this matter soon.
The first thing we have to do is to create a new Clojure project and include the dependency for elastisch in our project.clj file:
[clojurewerkz/elastisch "2.2.1"]
Once this is done, we can start establishing a connection with our ES server (which has to be running when we execute this code) using the function connect from the rest namespace in elastisch. After this, we can use this connection (in this case pointing to our local ES) to send requests to our system. The following code will create our Test index:
(ns clojure-elasticsearch.core (:require [clojurewerkz.elastisch.rest :as esr] [clojurewerkz.elastisch.rest.index :as idx])) (def index-name "test") (let [conn (esr/connect "http://127.0.0.1:9200")] (idx/create conn index-name))
Once the index is created, we can use the namespace document to start adding documents to our engine in order to retrieve them later. We can expand the previous code to add documents as follows:
(ns clojure-elasticsearch.core (:require [clojurewerkz.elastisch.rest :as esr] [clojurewerkz.elastisch.rest.index :as idx] [clojurewerkz.elastisch.rest.document :as doc])) (def index-name "test") (def type "articles") (let [conn (esr/connect "http://127.0.0.1:9200")] (idx/delete conn index-name) (idx/create conn index-name) (doc/create conn index-name type {:title "Article 1"; :date "2015-12-13T14:11:19"; :message "Testing"}) (doc/create conn index-name type {:title "article2"; :date "2015-04-13T14:11:19"; :message "Test"}) (doc/create conn index-name type {:title "Third"; :date "2015-11-13T14:12:19"; :message "Other"})
This code has also one additional line that deletes the index. This makes sure that there the create function works and that we have a newly created index. Finally, we would like to be able to search in all the fields in our index given a keyword:
(ns clojure-elasticsearch.core (:require [clojurewerkz.elastisch.rest :as esr] [clojurewerkz.elastisch.rest.index :as idx] [clojurewerkz.elastisch.rest.document :as doc] [clojurewerkz.elastisch.query :as q])) (def index-name "test") (def type "articles") (def query "third") (let [conn (esr/connect "http://127.0.0.1:9200")] (idx/delete conn index-name) (idx/create conn index-name) (doc/create conn index-name type {:title "Article 1"; :date "2015-12-13T14:11:19"; :message "Testing"}) (doc/create conn index-name type {:title "article2"; :date "2015-04-13T14:11:19"; :message "Test"}) (doc/create conn index-name type {:title "Third"; :date "2015-11-13T14:12:19"; :message "Other"}) ;; Force the program to stop for 2s because ES takes 1s (default) ;; to allow documents to be searchable (Thread/sleep 2000) (println (doc/search conn index-name type :query (q/match :_all query))))
If this code is run without the line asking the system to wait for 2 seconds, we won’t be able to retrieve any documents, as the time required for a document to be searchable is 1s (by default). The refreshing rate can easily be modified. Furthermore, this can cause significant efficiency gains as this blogpost explains.
This code will retrieve the specific document with a response very similar to the one shown below. This includes information from Elasticsearch itself (e.g., time that took to retrieve the results, number of shards, maximum relevance score, …) and the documents that matched our query.
{:took 11, :timed_out false, :_shards {:total 5, :successful 5, :failed 0}, :hits {:total 1, :max_score 0.11506981, :hits [{:_index test, :_type articles, :_id AVRTm4mKKoxNxvdBzdD_, :_score 0.11506981, :_source {:title Third, :date 2015-11-13T14:12:19, :message Other}}]}}
For this example in particular, I only care about the documents information and therefore I will access the :_source inside the :hits, inside the :hits. This could be easily done by changing our last line in the following way:
(println ((comp (partial map :_source) :hits :hits) (doc/search conn index-name type :query (q/match :_all query))))
The clojure code shown in this article is accessible in a GitGist. I hope you have found this blog interesting. I know the content is quite simple, but my intention was to show the potential of Elasticsearch and elastisch and give some references for the people starting out with both.