February PyData Meet-up

I have attended PyData once again the London PyData meet-up and I am as happy as I was the first time. The day started with some news from the organisers who listed some interesting discoveries within the python ecosystem:

  1. A very interesting page that shows the complexity of python operations.
  2. SymPy, a package for symbolic mathematics was named “package of the month”.
  3. They shown the distribution of attendees to the meet-up based on the day of the week. According to the data, Tuesday is the optimal day to have PyData meet-ups and no one cares about the group on Saturday. Good to see data-driven decisions in a data conference.
  4. Two podcasts (that I didn’t know before) focusing on Python: Podcast._init_.  and Talk python to me.

After the announcements, the main event started with a very hardware focused talk about high performance computing with Xeon Phi Cards. The speed improvements claimed by the presenter were quite impressive. However, some of the specific details were too low level for me because I have not explore the hardware side of high-performance computing in data science in a very long time. That being said, it is good to see the different perspectives of data science represented in the community. For those of you interested, the slides can be found here.

The second speaker was @PeterOwlett from @Deliveroo. First of all, I have to say that Deliveroo is awesome and it has been feeding late developing nights at Signal for a quite a while now so thank you guys!! Coming back to the talk, the main problem Peter was trying to solve was the prediction of the amount of drivers required at any given time in a way that Deliveroo had enough drivers to deliver the food in a reasonable amount of time, but not too many that they would lose money. Really good and interesting problem with lots of complexities (e.g., food being late, flat tires, geographical differences, …). The first step in the analysis is to divide London in different areas because the geographical dimension is critical for this type of problem. Once this is done, he looked at the quantity of drivers required over time as a time series problem in order to forecast future needs. This analysis was done for each area individually. His idea is that the time series will be able to separate an underlying trend, seasonal information and noise in a way that it will make predictions more accurate. The main library used for this was statstmodels allowing him to do some basic time analysis in a few lines of code. Also, he recommended the audience to check the (free) book Forecasting: Principles and Practice. One important factor is that they know that they have a heavily seasonal patter every 7 days as the behaviour when ordering take-away relies heavily on the day of the week. This makes some of the analysis easier compared to other problems when you need to extract this information from the data. The basic approach provided reasonable results but they were improved when the effect of holidays periods and weather (using categorical information from the Met office APIs) were included. We were also told how Luigi was very helpful and how to transform the drivers demand per hour to a data structure representing “shifts” for different drivers. I would love to see this presentation in more detail as a blog post or a white paper as I believe it is fantastic example of real data science. In the meanwhile, we can see the slides here. One final point is that although their results are amazing, they still have numerous factors that they could exploit such as internal marketing campaigns or congestion/road works information. I look forward to hear more about this.

The next speaker was @Kimknilsson from Pivigo (one of the sponsors of the meet-up). She started by comparing the past (around 150 yr ago) and the present with respect to the technology and devices we use and the offices we work in. Her conclusion was that the tools have changed beyond recognition (i.e., typewriter vs mobile phones), while the office space is virtually the same as it was a century and a half ago. She claimed that although there are some challenges such as communication problems and efficiency, distributed work is actually doable and could be very beneficial for all parties. To support this claim, she explained how they have applied this idea with a large group of distributed data scientists in multiple projects. Her findings where that using some processes like scrum stand-up sessions in the morning (via videoconference), having regular Q&A sessions and (surprisingly) having “virtual social events” where people chat about non-work related stuff while having a drink in front of the computer actually minimised the problems traditionally associated with remote working. My personal opinion is still that innovation happens in front of a whiteboard when you discuss interesting problems with intelligent people. Nonetheless, this talk also support my view that remote working is beneficial for companies in the technology space, and if done right, can improve the productivity of companies while increasing the people happiness. She recommended to read the blogpost Increase growth and revenue by becoming distributed which explains how Pivigo transform some of them events from physical to virtual. Kim also used some of her time to promote Science 2 Data Science, a very promising program that helps researchers in different fields to become data scientist through intensive courses and industry experience.

The final talk of the night was one of the funniest and unexpected usages of data science I have seen: How to compute an Optimal Portfolio of Pokemon. Vincent explain how to compute the “average number of turns” that a Pokemon could survive in a battle with any other Pokemon and plot this in a heat map with all the possible pair of Pokemon. This already give us a feeling of which Pokemons are basically useless (sorry Diglett…) as they will be defeated when fighting basically any other Pokemon. However, apart from this “outliers” it was quite unclear which were the best contenders. The data visualisation was good to provide a general understanding of the space, but  it cannot provide answers to our initial question: What Pokemon should I choose? One alternative would be to pick the one with the best average. However, as Vincent explains in his blogpost, this would ignore the fact that a Poekemon might be very good on average but terrible if fighting a specific enemy. For this reason, he changed his view of the problem and computed a unique score for each Pokemon based on Optimal stock portfolio theory where not only the profitability of the stock is considered, but also its volatility.

I still believe that PyData is the best meet-up for general data science and a fantastic place to hear amazing stories and accomplishments with real, applied data problems. Also, I had a great time at the pub afterwards with a lot of crazy discussions. See you in the next one!!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s