Strata+Hadoop World is one of the main conferences in the world for Big Data technologies and I was lucky enough to attend it last week. Even if this is my second Strata I couldn’t but be amazed at the scale of the conference, with 7 parallel sessions from topics ranging from data science to the future of Hadoop.
One of the characteristics of Strata is the number (and format) of keynotes. Every day, there were about 8 keynote speakers with 15 minutes each. This is quite a change compared to the usual 60-90 minutes per speaker in other conferences. I was quite impressed by the presentation by Cait O’Rioardan from Shazam, an app that can detect music tracks by listening to them, who explain how they use their data to detect “bands to be popular”. Their solution compares the number of Shazam clicks for songs being streamed in the radio the same amount of times. In addition, the relative time with respect to the beginning of the song was also very important as most of the popular songs had a much large number of shazams in the first 10 seconds than in the rest of the track. Other interesting insights were how to detect who is the main “driving” of a song with multiple famous singer and the correlation (or the lack of it) between movies releases and track popularity.
Another fantastic presentation was given by Tim Harford from the Financial Times. He explained how marginal or incremental improvements can compound into a large benefit if applied properly. This was brilliantly exemplified by the usage of “hot pants” to maintain the muscles in a perfect temperature just before a race by the UK cyclist team during the olympics. The talk had a very interesting twist when he focused on the life and research breakthroughs of Mario Capecchi, one of the fathers of genetic engineering. As a side note, his life is one of the most incredible stories of survival and scientific revolution, I encourage you all to at least see his Wikipedia page. Tim used him as an example of how milestones and breakthroughs in science (in opposition to marginal gains) are also needed as they are the ones that truly push the boundaries of human knowledge and change our view of the world. In the specific case of Mario Capecchie, he went against a national funding body to use research funds in a project that he believe in, even though he was explicitly forbidden to use them for that goal because the whole academic community thought it was closer to sci-fi rather than science. Nonetheless, he was stubborn (and brilliant) enough to pursue and succeed toward such goal for which he was later given the Nobel price in medicine. The main question the keynote posed to the audience was “why all the risk is on the entrepreneurial researcher while the benefits are shared with all the humankind?”.
There was also an important announcement during the keynotes as Google unveiled their Cloud Bigtable beta being available. It represents their newest key-value pair storage that can scale to Petabytes of information with billions of rows. For the sake of space, I won’t go into detail for the other keynotes, but I would strongly recommend visiting the Strata page to watch them online.
Once the keynotes were finished, we all divided into different sessions focusing on different specific areas (e.g., data science, business, beyond Hadoop, …). In addition to the talks, there were several stands for companies doing Big Data that were presenting their latest products of services. One of them stand from the rest: DataRobot, a company that is trying to commoditise the process of solving regression and classification problems with an automatic solution that compares multiple models, while providing detailed insight into the data (e.g., distributions, tag clouds, …) and the algorithms (e.g., learning curves, confidences, …). I found the system quite impressive and I will definitely play with it at some point. Incidentally, one of the people in the stand was a student of one of my thesis examiners.
I decided to attend most of the talks in the Data Science track. The first talk focused on feature engineering, which is still the most time-consuming step (85% according to the speaker) in most data science process. Also, it is well known that it could also be the one with the higher impact on the quality of the model. Another topic that was addresses was the deployment and evaluation of machine learning modes in production. Alize Zheng (from Dato) compared offline evaluation and A/B testing, as well as explaining the differences and problems with some evaluation metrics, being the most important one its non-correlation with the business goals and ROI. In a slightly more technical note, she suggested that Welsh T-test is a better statistical test than the traditional T-test to compare system accuracy as Welsh doesn’t have the assumption that the variance of distributions are the same.
Another talk from Dato, this time from its founder Carlos Guestrin, focused on how to use “deep features” to improve the representation of instances in a classification process. Carlos started by giving a gentle introduction to neural networks and how they can be used to represent complex (non-linear) representations of data. For instance, higher level features for face recognition could be nose, eye or mouth shapes. According to him, the main challenges of deep learning are that you need a huge amount of data and it is computationally expensive. On the other hand, deep learning is the state of the art for multiple machine learning tasks. Carlos’ proposal is to apply transfer learning with deep learning features and a linear classifier that predicts an output based on them. Imagine that we train a model that can differentiate between cat and dogs pictures, after this, we can take that neural network and remove the last layer (the one producing the final output). As a result, the neural network will output a vector of weights that represent multiple high dimension features. Now the magic part starts, we can use that neural network to represent labelled documents (e.g., images being human faces or not) and then use a linear classifier to learn from them. The main challenge for this approach to be more wildly applied is the ability to share such precomputed neural network in a centralised repository to encourage the dissemination of results. During his talk, he also mentioned the library word2vec that applies deep learning to create a distributional representation of textual documents. Both the idea of deep features in general, and the library word2vec in particular are things that I will explore much more deeper in the future.
There are other three talks that I would like to mention. Firstly, Felipe Hoffa presented a demonstration of the power of Google BigQuery by running complex SQL queries over a Terabyte of data in around 7 seconds, and by joining Terabyte size tables in under a minute. The second talk shown how Barclays is creating outstanding infographics and local insights to provide actionable insight for different businesses. Last but not least, Lars Trieloff gave a fantastic presentation about the communication breakdown between the data analysis and the decision making steps. He even went to the extend of suggesting leaving all decisions to computers because “99.9% of decisions can be automated”. He illustrated his point by showing that a human “enhanced” algorithm, a recommendation system where the decision is made by a human, was performing worse than a fully automated system. Although I don’t agree completely with Lars’ perspective, this seems like a reasonable approach for the specific case of inventory replenishing that he was describing. Obviously, some type of alert system should be put in place to communicate unusual orders to indicate that some unexpected situation might be happening.
I deeply enjoyed the first day of the conference, and I managed to have really interesting discussions with amazing people. Also, (as usual after a conference) I have a list of technologies, services and tools to play with, including IPython notebooks, Spark, DataRobot, Dato and deep features. Furthermore, the conference also reinforced the idea that Python and R are still the most common languages within the data science community. However, I discovered that a lot of people knew about Clojure which was an unexpected surprise.