Strata + Hadoop (Second day)

The second (and last) day of the conference started with presentations from two massive companies: Philip Radley shown how BT is relying on Hadoop to achieve a lot of increase in value for their clients; and Rod Smith (from IBM) defended the position that digital innovation is nowadays driven by real time insights. He claimed that realtime is becoming a critical cornerstone and summarised the three types of data analysis process we have seen in the last years:

Traditional: Time spent moving data around rather than analysing it.
Big Data: Driven by contextual data, more time analysing than driving actionable insights.
Rapid insights: Just in time quick approximations of solutions.

From a technical perspective, the presentation also illustrated the benefits of Spark and it also described the Notebook IDEs as “the next generation spreadsheets“. One of the main events of the day happened when Ben Lorica unveiled the new design for the O’Reilly portal. I believe that the new layout is much more elegant and simple, while having way more information available. This opinion seems to be the norm within the community, based on the Twitter conversations and the chats I had during the coffee breaks. The first technical talk of the day I attended was presented by Costinl (from elastic). He reminded us that data is not static anymore and fixed schemas are not well suited for a lot of the problems. He talked about how Elasticsearch can provide faceted search using Github (one of their users) as an example for bucketing and metrics. Another use case mentioned during the presentation was how The Guardian is using Kibana to track user interaction with their system. The next presentation focused on how to use Elasticsearch and Spark to process billions of tweets. This talk was given by Anirudh Koul and Shashank Singh, from Microsoft. During their presentation, they provided a number of useful tips, especially related to Elasticsearch. These advices went from the most obvious ones to more advance based on shard sizes:

Only index what you will be searching for
Reduce the document size by skipping unwanted fields
Use bulk operations as much as possible
Keep shard size below 50GB

One of the tips I didn’t expect to hear was that deciding a predictable document identifier rather than a random one had massive improvements in their system. According to their data, using uuid (v4) as their document id achieves, by far, the worst performance. This sounds like a very interesting insight and I would to re-watch the presentation once it becomes online to understand it better. Independently of elasticsearch, the talk also provided other talks and little information nuggets. They mentioned that chi square is much better than slope to detect trends on time data, and that only 68% of the tweets can be mapped to the physical world (e.g., people, places, …). The next talk presented Sparkta, a “Pure spark declarative aggregation workflow that is easy to use with high performance at scale”. The talk was given by Oscar Méndez and David Morales from Stratio. Sparkta provides an off-the-shelf integration with several data stores such as oracle, db2, mongo or Cassandra. The most common usage would include storing data in HDFS, apply Spark to process and aggregate it and store the insights in a NoSQL database. During the presentation they mentioned other similar projects like Ranbird (Twitter), Countandra (Deprecated), ThunderRain (Intel) or Tsar (Twitter). I apologise for all the talks I haven’t summarised here, and for the ones I couldn’t even attend. This is one of the “problems” of multitrack conferences. All in all, the conference was a great experience and a perfect opportunity to meet new people and get exposed to new tools and technologies.

The Practical Academic

Merging the best research in Text Analytics with practical and commercial perspectives

Strata + Hadoop (Second day)

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply