It has been ages since I wrote my last post and I think my first PyData meet-up is the perfect event to catch-up with it. PyData (London) is the biggest group for both Python and Data Science in London with more than 2,300 registered members.
The event started with some community announcements about a workshop on python for high computing performance (PyHPC) and the NIPS conference, which was introduced as “the most important machine learning conference in the world”. I personally think that this is not completely correct. NIPS is indeed one of the most important conferences in the field, but as far as I know it is more focus on Neural Networks than Machine Learning in general. I would argue that KDD or ICML are actually the most important conferences in the field. Another interesting piece of information was the list of most requested topics for future talks: Machine Learning, Data Analysis, Medium Data and Deep Networks. It is great to see the topic on Medium Data in that list. We all hear things about Big Data, but a lot people are actually working on medium size, very rich data rather than simpler but much larger datasets. The host of the event (AHL) also proposed a coding challenge that sparked the imagination of a lot of people of the room, just to take our hopes away by saying that the challenge is open for students only. There were multiple (half-joking) comments from people who wanted to enrol on a Master program just to be able to participate.
The first speaker of the night was Gabriel Straub, the Head of Data Science from Tesco. The content of his talk was remarkably similar to some of mine as he talked about how to incorporate agile methodologies such as Scrum or Kanban into a research team. I am not going to explain what these techniques are because there are hundreds of great sources of information for this in the internet. Nonetheless, I want to focus on one point that Gabriel mentioned in his talk: Agile methodologies are based on some common assumptions which hold in the majority of cases for development teams:
- The team has one common skill set, so every developer can take responsibility for any task if needed.
- Every story has a clearly defined goal and direction (i.e., user acceptance criteria)
However, neither of these applies directly to research teams where there are multiple skill sets (e.g., network analysis vs search engine expertise) and no clear direction nor an indication of the feasibility for some tasks. For instance, “Improving the quality of the recommendation part of the system” is a task that might be infeasible; however, that information would be available only after, or at least during, the processing of the ticket. Other stories are even more abstract, especially those related to data or exploratory analysis where we are actually looking for questions to ask, rather than answers. Gabriel went on explaining two concepts that are also a pivotal part of my research team:
- Test quick and kill early: Quick proof of concepts are the best way to know if it is worth or not pursuing a specific line of research in more depth.
- Unobservable quality differences (by the user) are not worth spending significant effort on as we must focus on the value for the customer. This principle has opened multiple debates in the past and a lot of people (especially those coming from a strong academic background) will argue that by not investing on what seems a small improvement at the time, we are limiting the chances of a big breakthrough in the future. While I do not disagree with this argument, we must make the best use of the available resources and we are never short of a new challenge to solve.
Some of the things that are working well for them are A/B testing, a constant agility (even with respect to the agile process itself) and being very close to the costumer. On the other hand, measuring progress and how to make the skill set more uniform across the team are parts of the process that still require improvement. Gabriel ended his presentation with the following take away points:
- Make hypotheses explicit: For instance “Test SVM with textual features to improve F1 given collection X” is a better ticket definition than “Improve classification quality”. This allows a much better definition of the research tasks.
- Time-box development. The idea behind this principle is to allocate a specific amount of time during which the team will try to improve a specific part of the system. I understand the logic behind it, but I have to admit that I am not a fan given its rigidity.
- Pair programming is key: Never allow people to work alone in one project. I cannot agree more on this, but I can also assure everyone that this is an extremely difficult challenge in a research team environment.
- Measure and optimise ways of working: Great point, but again, very difficult to implement in reality. All of us have to be more inventive on measuring ourselves and the improvements of the team over time.
The second speaker of the day was Dave Willmer who was showing us the latest developments with Jupyter. Jupyter is an interactive data science environment that was created for Python. However, multiple languages are now supported (including Clojure via Clojupyter). Dave show several improvements ending in a very impressive (recorded) demo where he shown the equivalent of a fully functional IDE working on the browser with autocompletion, multiple tabs and windows and project structure. It was brilliant to see him complaining about the lack of power of the current web development technologies and how they have barely achieved the same level of interface sophistication shown by the Xerox research team on a desktop computer back in 1979 (video). He topped this by saying “Web technology is terrible for applications, stick with Python“. I have to admit that I loved this part. Also, Jupyter has been in my list of things to play for a very long time, and this talk is yet another reminder that I must check it right now.
After Dave, the lighting talks (5 minutes) started with Alina Solovjova from GoCardless who shown how to use cartopy to create a cool data visualisation with a minimum amount of code. Cartopy behaviour is very similar to matplotlib and if we add shapefiles to define specific regions in map (e.g., England geographical regions), we will have a powerful data visualisation kit for geographical data. In her specific case, she obtained the UK regions boundary information from the Office for National Statistics (ONS).
The last talk of the day was given by my good friend Marco Bonzanini. His presentation focused on the use of Luigi to simplify the creation of batch processing pipeline jobs. Everyone who has worked in data science has suffered the pain of setting up, maintaining and debugging such pipelines and any abstraction that can help with this is welcome. Luigi support this with the only requirement of redefining three methods related to the input and output of each step in the pipeline, as well as the logic for such step. Marco explains this in much more detail in this blogpost. He also raised a couple of common sense points during his talk that are actually not that common in reality and I think they are worth mentioning again:
- Logging thing properly: I cannot agree more, this includes logging the steps of any of your models so debugging would be possible if something goes wrong, as well as separating the standard and error outputs properly. There are very few thing that frustrate me as much as a machine learning component randomly spitting text into the standard output with no possibility of silence it.
- Parameterise everything: You never know what parameters and variable will your you want to change, especially if other people will use your code.
- Package your code properly: This one is self-explanatory and I cannot but agree completely.
As I mentioned at the beginning, the meet-up was a great event and I would definitely be back in the future. Also, this is yet another reminder that I should pick-up my Python skills and play with some of the new cool libraries and tools that the community is offering.