I have attended my first ever Strata + Hadoop conference a week ago in Barcelona. Strata is probably the biggest and most important applied data science conference in the commercial world. My first impression is that it is a quite big conference, with an annoyingly large number of parallel tracks (up to 8). This means that you always find yourself deciding what talk to go next, and this can be a very complex decision because some of the talks (and tracks) are extremely related. This blogpost will summarise the feelings I had after the conference and the main “buzzwords” and most mentioned tools.
I attended the Spark camp tutorial given by the guys from Databricks the first day of the conference. Spark is probably the most used word during the conference, and it is at the top of my list of “technologies to try” right now. In a nutshell, Spark is an in-memory cluster computing that has one very interesting characteristic: It allows batch (offline) processing in a similar fashion that Hadoop does (although with a much higher abstraction), but it is also capable of process streaming data for online real-time requirements. This relates to the Lambda architecture, another concept that was repeated multiple times during the conference. In addition to this capability, I received a lot of good feedback about it and several people told me that they were either using it now or heavily considering a proof of concept for the technology. Going back to the tutorial itself, although I think the presenters were fantastic, I don’t think that a room with 300 people is the best environment for a semi-interactive tutorial. On the other hand, this tutorial “forced” me to write my first line of Scala which was quite interesting.
After the tutorial day, the main conference started. The first two things that I realised were the amazing quality of all the presenters, especially when compared to the research conferences I usually attend, and the technical details described in them. As I mentioned before, this was my first Strata conference and it was much more architectural and developer focused (rather than more “pure” data science oriented) that I anticipated. Nonetheless, this was most likely affected by the tracks and talks I choose to attend.
One of the main attractive points of Strata was to see that most presentations had some type of demonstration, usually live with on-the-cloud processing. In addition, almost all of them used a iPython notebook. The IPython Notebook is “a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document”. This produces a reproducible and easy to share environment and it sounds like a very fun thing to play with. This could be even more true for people in academia…
One of the other very interesting projects presented on the conference was SAMOA (Scalable Advanced massive Online Analysis), an early-stage open-source project which was born at Yahoo! to tackle the problem of efficiently performing machine learning tasks on streams of data. Definitely worth a look…
Another tool that I want to comment about is Elastic Search. Elastic Search is an amazingly scalable search server based on Lucene and it would be enough to fill hundreds of blogposts by itself. However, the main reason why I mention it here is the possibility of using it in conjunction with Spark providing streamline processing with real-time search and data visualisation capabilities for data analysis using Kibana.
As I mentioned before, this post was intended to present some of the tools and ideas that I found interested during the conference, some of which are listed before in no particular order:
- Docker
- Spark
- iPython notebook
-
Integration of Spark+Hadoop+ElasticSearch+Kibana
It is likely that I will publish more post about some of them in the future as I get to play with them.