Reuters-21578 is arguably the most commonly used collection for text classification during the last two decades, and it has been used in some of the most influential papers on the field. For instance, Text Categorization with Support Vector Machines: Learning with Many Relevant Features by Thorsten Joachims. This dataset contains structured information about newswire articles that can be assigned to several classes, therefore making this a multi-label problem. It has a highly skewed distribution of documents over categories, where a large proportion of documents belong to few topics.
The integration of research ideas in commercial products is one of the most difficult challenges companies are facing in the predictive analytics and data science space. This post reflects my thoughts about this topic and an explanation of the S.I.M.P.L.E. approach that we use in Signal to incorporate new models or ideas into our pipeline.
Before going deeper into this topic, I want to say that I know for a fact that I do not have all the answers (not ever all the questions) to this challenge, and there is a lot to be learnt from some of the giants of our sector such as Google, Netflix, Twitter or Amazon (to name a few), that are mastering the R&D integration.
The research and academic community is introducing amazing ideas and new approaches, pushing the limit of human knowledge. However, in the particular case of Text Analytics research, several (if not most) of such ideas are never implemented in a real commercial environment. There are several potential reasons for this:
1. A solution that worked well on small scale datasets might not be able to scale to the amount of data processed by a specific business in a given period of time. Alternatively, the solution might be able to scale, but at a high cost. Most of the times, models can be scaled-up be adding more computational power. However, this could make the solution more expensive than the perceived benefit of using it. For instance, imagine a model that cost twice as much as the current approach and that provides 1% improvement on quality. This approach is not likely to be chosen in a real-life scenario.
2. The problem itself does not need a solution. Just to be clear, I am not saying that research should focus only on problems born in a commercial environment. If we would have done that, some of the most important discoveries in history might have been lost. Nonetheless, we have to understand that some research has not real application at the moment of its conception.
3. The solution is difficult to implement or maintain. Although some people like to see algorithms and models as “magic” black boxes, this is usually the opposite of what software engineers and architects need. They need to understand the implementation principles of such libraries in order to do their job better. Unfortunately, this exposes one of the main drawbacks of the academic community: A lot of researchers are not that good at coding. This is illustrated in research libraries which cannot be used in a commercial environment because they lack a minimum level of developing quality. This could be seen on poorly defined APIs, lack of multi-thread capabilities, incorrect documentation, outdated technologies, complexity to deploy, …
In addition to these problems, even if a scalable, good-enough quality is implemented in a correct way, the integration of such module in a production pipeline might present additional challenges in itself.
I am afraid that I have not found a silver bullet to solve these problems. However, we have came up with a process in Signal that has shown to minimise some of the problems for us. For every new component into the system (e.g., machine translation) we follow the S.I.M.P.L.E. steps: Study, Implement, Model, Prepare, Live, and Evolve.
This step involves catching up with the latest developments in the research community. Specifically, we would focus on the following points:
- How is the problem defined
- How is the quality of a solution measured
- What are the best approaches to address it and how good they are
- Common mistakes to avoid and quick wins (e.g., simple ideas that seem to perform well)
- Current libraries or services
Once we have done the literature review, it is time to implement the first ever solution to the problem, and in order to do so we will focus on the simplest possible approach that could work. This Proof of Concept (POC) will allow to see the potential benefit of the component for our system. After the POC is completed, a qualitative evaluation is perform by manually observing the output of the system based on a small sample of our own data.
Assuming that the POC showed some potential, we move into the modelling phase where a proper framework for the problem is created.The main goal is to allow the automatic evaluation of multiple models with one or more static datasets. This aligns directly with the traditional experimentation setting in most of the related research fields. This could also include data collection and annotation. Obviously, the evaluation metrics are modified to fit our business goals. This code should be of high standard and it must have good test coverage and design. At the end of this step we should be able to rank and compare different approaches both by effectiveness and efficiency.
If at least one of the models selected from the previous step is potentially good enough from the business perspective, they are moved into the preparation phase. At this stage, people from research and development refactor the code-base to make it more efficient, elegant and maintainable. They usually increase the test coverage and documentation as well. Moreover, architectonical questions related to the integration of the component are also answered in this step: What extra infrastructure do we need? Can we use the current data model? How and where do we store the models? Can we cache some of the data? … These are only a few of the multiple questions we usually face in the discussion. After all this is done, the model, or the whole framework in some cases, is ready to go “live”.
The development team integrates the framework or selected models into the pipeline and the new version of the system is pushed into a staging environment. Everything is now ready to analyse new data being processed by the new component and to allow a front-end feature to use this information.
The last, but not least, step of the process is to keep evolving and tuning the solutions over time. Topic drift and changes in the environment or the product will have a massive impact of the model itself and therefore, its quality might decrease. This step requires monitoring tools to inspect the data and the system looking for changes, as well as continuing to research new ideas to solve the problem.
We believe that this steps allow an almost seamless integration between the research and development teams and improve the research capabilities while allowing for quick deployment of new solutions or components. It also helps the cohesiveness and increases the knowledge of the teams by requiring a high degree of interaction between researchers and developers.
I believe that research and development integration is one of the main challenges in innovative companies within the analytics space and I have presented one of the processes we have in Signal to address it. However, there is no silver bullet and the SIMPLE approach might not work for all companies. In our case, both researchers and developers have embraced it and are quite happy with it.
I have been preparing a couple of talks I have to give in the next couple of weeks and I needed some pictures of the people working in Signal to have some nice images about the team and the company in general. Although we have some of them store online, I realised that our Twitter account had some of the best pictures, especially for the early days of the company. Almost at the same time, I was reading a blogpost about mining twitter data with python, written by my good friend and ex-colleague (in Queen Mary), Dr. Marco Bonzanini. These two events together seemed like a good excuse to build a little tool in python to download the pictures that a twitter account has published and this is the main focus of this post. I hope you find it useful, I definitely have…