I have been quite curious about the IBM Watson ecosystem and their set of APIs for quite a long time now, and I have finally found some time to start playing with some of its modules. The ecosystem has numerous APIs that expose functions to solve different problems such as personality detection or machine translation to cite a couple of them. In my particular case, I was more interested on the APIs provided by one of their recently acquired companies, AlchemyAPI, that provides Natural Language Processing (NLP) operations. After looking at all the possible options, I decided to investigate the following set of calls to get an idea of their accuracy and flexibility:
- Entity Extraction
- Sentiment Analysis
- Concept Tagging
- Relation Extraction
- Taxonomy Classification
Each one of this tasks has been addressed based on pure text that we have passed as a parameter of the call. Although I would assume this is usually the most common requirement, the APIs support other input formats (e.g., a url). The usage of the different APIs is similar and it is described in detail in the documentation. I won’t go into the detail of how to run all the APIs because all the code required is in my GitHub Repository. The following code (also as gist) shows how to run the entity extraction using Clojure:
(ns ibmwatson.text (:require [clj-http.client :as client])) (defn- text-call [text endpoint params] (let [root_url (str "http://access.alchemyapi.com/calls/text/" endpoint)] (client/get root_url {:query-params params}))) (defn entities [text api_key] (let [params {"apikey" api_key "text" text "outputMode" "json" "disambiguated" 1}] (text-call text "TextGetRankedNamedEntities" params)))
The clj-http library provides very useful and simple functions to interact with RESTfull APIs. The only requirement is to add it into the dependency list in our project.clj as shown below:
(defproject ibmwatson "SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clj-http "1.1.2"]])
The entities function is not, by any means, a final version. The parameters of the main call are hardcoded and the entities will always be disambiguated. I might work more on this in the future. However, if I use the AlchemyAPI more extensively, I will spend more time looking into existing solutions such as this SDK for AlchemyAPI in Clojure.
Now, how good are the results we are getting? The following is part of the result for the entity disambiguation given the text: “Donald Trump announced that he is going to run for the Republican party’s presidential”:
“entities”: [ {
“type”: “Person”,
“relevance”: “0.33”,
“count”: “2”,
“text”: “Donald Trump”,
“disambiguated”: {
“subType”: [ … “Celebrity”, “CompanyFounder” … ]
“name”: “Donald Trump”,
“website”: “http://www.trumponline.com/”,
“dbpedia”: “http://dbpedia.org/resource/Donald_Trump”,
“freebase”: “http://rdf.freebase.com/ns/m.0cqt90”,
“opencyc”: “http://sw.opencyc.org/concept/Mx4rv0ncIZwpEbGdrcN5Y29ycA”,
“yago”: “http://yago-knowledge.org/resource/Donald_Trump” } },
{ “type”: “Organization”,
“relevance”: “0.33”,
“count”: “1”,
“text”: “Republican party”} ]
This result shows a quite good NER detection but a lack of recall in the disambiguation step. The republican party probably should have been linked as well. On the other hand, this shows a large number of knowledge bases that the API links to: Websites, DBPedia, Freebase, Yago, … This is a really good tool that can add a lot of extra value. An interesting output for the same test sentence is the one given by the concept tagging, where the wife and son of Donald Trump are picked as the most related concepts:
“concepts”: [{
“text”: “Donald Trump”,
“relevance”: “0.92115”,
“website”: “http://www.trumponline.com/”,
“dbpedia”: “http://dbpedia.org/resource/Donald_Trump”,
“freebase”: “http://rdf.freebase.com/ns/m.0cqt90”,
“opencyc”: “http://sw.opencyc.org/concept/Mx4rv0ncIZwpEbGdrcN5Y29ycA”,
“yago”: “http://yago-knowledge.org/resource/Donald_Trump”},
{ “text”: “Ivana Trump”,
“relevance”: “0.771705”,
“website”: “http://www.ivanatrump.com/”,
“dbpedia”: “http://dbpedia.org/resource/Ivana_Trump”,
“freebase”: “http://rdf.freebase.com/ns/m.0429hq”,
“opencyc”: “http://sw.opencyc.org/concept/Mx4rLnwX_CzvQdiO0r355SCNzA”,
“yago”: “http://yago-knowledge.org/resource/Ivana_Trump”},
{“text”: “Fred Trump”,
“relevance”: “0.768877”,
“dbpedia”: “http://dbpedia.org/resource/Fred_Trump”,
“freebase”: “http://rdf.freebase.com/ns/m.04g__c”,
“yago”: “http://yago-knowledge.org/resource/Fred_Trump”}]
Documents with more context (i.e., longer text) provide even better outputs for the concept extraction.
After this quick interaction with the IBM Watson, and more specifically, the Alchemy APIs I have to admit that my thoughts are quite divided with respect to the design and the quality output.
Although the API is slightly confusing at the beginning, it allows for complex combinations between different “components”. For instance, to return the sentiment analysis of a document at the entity level, we have to call the entity extraction API rather than the sentiment analysis one. The reason behind it becomes apparent after using it for a while: the sentiment is probably wrapped around the output of the entity extraction step and it becomes simpler to specified as a parameter for the latter (e.g., “sentiment = 1”).
In terms of the quality, some of the modules were given worse results that I was expecting, although they are still very competitive for a commercial setting. For instance, the sentiment component was quite confused with almost-neutral sentences and it was giving me negative scores for no apparent reason. It felt like a neutral category was basically ignored. On the bright side, it dealt very well with complex situations involving negation. A similar situation appear with the entity extraction, the relationship extraction and, to some extent, the taxonomy classification components. Now, the APIs that were given quite amazing results was the concept tagging. This API pulled related concepts (mainly from Wikipedia) that I assume were obtained using a quite extensive and detailed knowledge graph. The results and the relevance score for those concepts uncovered underlying concepts that were not mentioned at all in the document with very good accuracy and no effort from the developers perspective.