9-Free-Books-for-Learning-Data-Mining-Data-Analysis

Elasticsearch and Clojure: Getting Started

Search is omnipresent these days, from the moment we type a set of keywords into our favourite search engine to find a webpage we are looking for to the moment we type a name and expect our email client to find all the emails sent by that person. Both these processes are based on years of research and experimentation in the field of Information Retrieval in order to efficiently being able to find the most relevant documents.

This blogpost will show how to set up Elasticsearch, one of  the best and most popular search engines (with Solr being the other main alternative). Its main characteristic is to allow unbelievable scalability and advance querying and indexing capabilities with minimum engineering effort. In addition to this, I will also shown how to perform some  basic operations using elastisch, a fantastic library for elasticsearch written in Clojure.

Installation
In order to install our desired version of the system, we can go to the elastic website and download a zip file. I am using the latest stable version (2.3.1). After this, we can unzip the file in any location, run ./bin/elastisearch and we have an elasticsearch environment working and running. You can check that everything is correct just by hitting localhost:9200 (the default port where the system is running) and the output should be similar to the following:

Screen Shot 2016-04-23 at 19.46.32.png

This has created a system with one node (server) but no indexes nor shards have been created yet. So, our next step would be to create an index and start injecting documents into it. I am not going to explain a lot of the details of elasticsearch itself (e.g., nodes, shards, …). Some good resources for looking into the topic are listed bellow:

Adding data
Elasticsearch is designed to work vis Rest API which makes its management quite simple. For instance, in order to create a new index, the curl request shown below will suffice. You can find more information about the capabilities related to indices in the documentation.

curl -XPUT 'http://localhost:9200/test/'

This will create a new index named test using the default configuration. We can then add and search some documents:

curl -XPUT 'http://localhost:9200/test/article/1' -d 
'{ "title" : "Article 1", 
   "date" : "2015-12-13T14:11:19", 
   "message" : "Testing the system" }'

The previous command will create our first document, of type article (types are similar to tables in traditional DB systems and it helps to abstract different data types), in the index test using the id 1, and the system will return the following information confirming the insert:

{"_index":"test",
 "_type":"article",
 "_id":"1",
 "_version":1,
 "_shards": {"total":2,
             "successful":1,
             "failed":0},
 "created":true}

Obviously, we wouldn’t like to add documents one by one. For this reason, elasticsearch has a bulk operation capability. When using curl, we can use a text file as a source of the bulk commands to be applied. In our case, we have created a requests.txt file as shown bellow:

{ "index" : { "_index" : "test", "_type" : "article", "_id" : "2" } } 
{ "title" : "article2", "date" : "2015-04-13T14:11:19", "message": "Test"}
{ "index" : { "_index" : "test", "_type" : "article", "_id" : "3" } } 
{ "title" : "Third", "date" : "2015-11-13T14:12:19", "message" : "Other"}
curl -s -XPOST localhost:9200/_bulk --data-binary "@requests.txt"

Applying the simple command shown above, each one of the bulk commands will be applied. In this particular example, the response from the system, confirming that everything was correct, is the following:

{"took":8,
 "errors":false,
 "items":[{"index":{"_index":"test","_type":"article","_id":"2",
                    "_version":1,
                    "_shards":{"total":2,"successful":1,"failed":0},
                    "created":true,
                    "status":201}},
          {"index":{"_index":"test","_type":"article","_id":"3",
                    "_version":1,
                    "_shards":{"total":2,"successful":1,"failed":0},
                    "created":true,
                    "status":201}}]}

We have now three documents in our index and we might want to search them using keywords. For instance, we want to retrieve those mentioning the word “third”:

curl -XGET 'http://localhost:9200/test/article/_search?q=third'

As expected, this query only returns the last document we inserted into elastic search:

{"took":35,"timed_out":false,
 "_shards":{"total":5,"successful":5,"failed":0},
 "hits":{"total":1,"max_score":0.375,
         "hits":[{"_index":"test","_type":"article","_id":"3",
         "_score":0.375,
         "_source":{ "title" : "Third", 
                     "date" : "2015-11-13T14:12:19", 
                     "message" : "Other"}}]}}

I understand that indexing three documents does not sound that exciting, but the foundations and principles are exactly the same and elasticsearch could easily scale up to millions of articles. Furthermore, these examples using curl are good to explain the main concepts, but this is probably not the way you will use Elasticsearch in a real environment. The next section will show how to recreate the same steps I have just shown with curl, using Clojure.

Programatic access with Clojure
This part of the blogpost will assume that we have downloaded the elasticsearch files and we have started a completely empty cluster locally as explained in the previous sections. For those of you who have still an index in the system, you can delete our test index with the following command in order to start with a clean environment (note: be very careful when deleting indices):

curl -XDELETE 'http://localhost:9200/test/'

The library I am going to use is elastisch, which provides a very clear Clojure interface to most of the elastic endpoints. In this case, we are just going to recreate the same steps we have done with curl, but I am intending to write more advance blogposts on this matter soon.

The first thing we have to do is to create a new Clojure project and include the dependency for elastisch in our project.clj file:

 [clojurewerkz/elastisch "2.2.1"]

Once this is done, we can start establishing a  connection with our ES server (which has to be running when we execute this code) using the function connect from the rest namespace in elastisch. After this, we can use this connection (in this case pointing to our local ES) to send requests to our system. The following code will create our Test index:

(ns clojure-elasticsearch.core
  (:require [clojurewerkz.elastisch.rest :as esr]
            [clojurewerkz.elastisch.rest.index :as idx]))

(def index-name "test")

(let [conn (esr/connect "http://127.0.0.1:9200")]
(idx/create conn index-name))

Once the index is created, we can use the namespace document to start adding documents to our engine in order to retrieve them later. We can expand the previous code to add documents as follows:

(ns clojure-elasticsearch.core
  (:require [clojurewerkz.elastisch.rest :as esr]
            [clojurewerkz.elastisch.rest.index :as idx]
            [clojurewerkz.elastisch.rest.document :as doc]))

(def index-name "test")
(def type "articles")

(let [conn (esr/connect "http://127.0.0.1:9200")]
  (idx/delete conn index-name)
  (idx/create conn index-name)

  (doc/create conn index-name type {:title "Article 1";
                                    :date "2015-12-13T14:11:19";
                                    :message "Testing"})
  (doc/create conn index-name type {:title "article2";
                                    :date "2015-04-13T14:11:19";
                                    :message "Test"})
  (doc/create conn index-name type {:title "Third";
                                    :date "2015-11-13T14:12:19";
                                    :message "Other"})

This code has also one additional line that deletes the index. This makes sure that there the create function works and that we have a newly created index. Finally, we would like to be able to search in all the fields in our index given a keyword:

(ns clojure-elasticsearch.core
  (:require [clojurewerkz.elastisch.rest :as esr]
            [clojurewerkz.elastisch.rest.index :as idx]
            [clojurewerkz.elastisch.rest.document :as doc]
            [clojurewerkz.elastisch.query :as q]))
(def index-name "test")
(def type "articles")
(def query "third")

(let [conn (esr/connect "http://127.0.0.1:9200")]
  (idx/delete conn index-name)
  (idx/create conn index-name)

  (doc/create conn index-name type {:title "Article 1";
                                    :date "2015-12-13T14:11:19";
                                    :message "Testing"})
  (doc/create conn index-name type {:title "article2";
                                    :date "2015-04-13T14:11:19";
                                    :message "Test"})
  (doc/create conn index-name type {:title "Third";
                                    :date "2015-11-13T14:12:19";
                                    :message "Other"})

;; Force the program to stop for 2s because ES takes 1s (default)
;; to allow documents to be searchable
(Thread/sleep 2000)

(println (doc/search conn index-name type :query (q/match :_all query)))) 

If this code is run without the line asking the system to wait for 2 seconds, we won’t be able to retrieve any documents, as the time required for a document to be searchable is 1s (by default). The refreshing rate can easily be modified. Furthermore, this can cause significant efficiency gains as this blogpost explains.

This code will retrieve the specific document with a response very similar to the one shown below. This includes information from Elasticsearch itself (e.g., time that took to retrieve the results, number of shards, maximum relevance score, …) and the documents that matched our query.

{:took 11, 
 :timed_out false, 
 :_shards {:total 5, :successful 5, :failed 0}, 
 :hits {:total 1, 
        :max_score 0.11506981, 
        :hits [{:_index test, 
                :_type articles, 
                :_id AVRTm4mKKoxNxvdBzdD_, 
                :_score 0.11506981, 
                :_source {:title Third, 
                          :date 2015-11-13T14:12:19, 
                          :message Other}}]}}

For this example in particular, I only care about the documents information and therefore I will access the :_source inside the :hits, inside the :hits. This could be easily done by changing our last line in the following way:

(println ((comp (partial map :_source) :hits :hits)
           (doc/search conn index-name type :query (q/match :_all query))))

The clojure code shown in this article is accessible in a GitGist. I hope you have found this blog interesting. I know the content is quite simple, but my intention was to show the potential of Elasticsearch and elastisch and give some references for the people starting out with both.

 

DSC04518

NewsIR 2016

After a long organisation process (explained in my last blogpost), the workshop on Recent Trends in News Information Retrieval (NewsIR) finally took place during the European Conference on Information Retrieval (ECIR) a couple of weeks ago in Italy.

Continue reading

DSC04538

NewsIR 2016 (Behind the scenes)

About a week ago I attended the European Conference on Information Retrieval (ECIR). The conference was great and I will write a blogpost about it soon. However, the main focus of this article is one specific workshop within that conference: the Recent Trend in News Information Retrieval (NewsIR). The reason why I want to talk about it is because I was the lead organiser and the event ended up being a success much bigger than we could have predicted. This blogpost will explain how the workshop idea was born and how the workshop was organised. We thought it is worth sharing this knowledge hoping that other people can get some insight out of it. A latter blogpost will focus on the content of the workshop itself.

Continue reading

PyData_Logo

February PyData Meet-up

I have attended PyData once again the London PyData meet-up and I am as happy as I was the first time. The day started with some news from the organisers who listed some interesting discoveries within the python ecosystem:

Continue reading

PyData_Logo

PyData meetup

It has been ages since I wrote my last post and I think my first PyData meet-up is the perfect event to catch-up with it. PyData (London) is the biggest group for both Python and Data Science in London with more than 2,300 registered members.

Continue reading

What are the skills that the industry is looking for in new developers?

Life has been very busy (but good) these last weeks both from the professional and personal life and I have neglected the blog. I will change this in the next weeks and try to come back to my usual speed. For the moment being, I will share my answer to a question that I have been asked in several occasions in the past, especially when I visit universities: “What are the skills that the industry is looking for in new developers?”:

Continue reading

IBM Watson APIs

I have been quite curious about the IBM Watson ecosystem and their set of APIs for quite a long time now, and I have finally found some time to start playing with some of its modules. The ecosystem has numerous APIs that expose functions to solve different problems such as personality detection or machine translation to cite a couple of them. In my particular case, I was more interested on the APIs provided by one of their recently acquired companies, AlchemyAPI, that provides Natural Language Processing (NLP) operations. After looking at all the possible options, I decided to investigate the following set of calls to get an idea of their accuracy and flexibility:

Continue reading