I have recently come back from my first ever Python Conference (PyCon), and in fact, my first ever generalistic development conference. This was quite a new experience as I am used to either academic (e.g., ECIR) or data-centric (e.g., Strata) conferences. PyConUK was quite different in many ways to the events I am used to, and I could not be happier I have attended it. The main reason is that Marco Bonzanini and myself had a workshop on Natural Language Processing in Python during the conference, but I also saw this as a great opportunity to get involved in a community that I have never been close to, despite the fact that I have coded in Python (intermittently) for several years.
Reuters-21578 is arguably the most commonly used collection for text classification during the last two decades, and it has been used in some of the most influential papers on the field. For instance, Text Categorization with Support Vector Machines: Learning with Many Relevant Features by Thorsten Joachims. This dataset contains structured information about newswire articles that can be assigned to several classes, therefore making this a multi-label problem. It has a highly skewed distribution of documents over categories, where a large proportion of documents belong to few topics.
I have been preparing a couple of talks I have to give in the next couple of weeks and I needed some pictures of the people working in Signal to have some nice images about the team and the company in general. Although we have some of them store online, I realised that our Twitter account had some of the best pictures, especially for the early days of the company. Almost at the same time, I was reading a blogpost about mining twitter data with python, written by my good friend and ex-colleague (in Queen Mary), Dr. Marco Bonzanini. These two events together seemed like a good excuse to build a little tool in python to download the pictures that a twitter account has published and this is the main focus of this post. I hope you find it useful, I definitely have…
I have spoken before about the Kaggle ecosystem and the Digit recognition challenge, and I have also shown how to improve the original version of the code. However, no quality improvement over the initial solution was attempted. This blogpost focuses exactly on that: What can we do to improve the quality of our results?
In the last blog, I focused on a basic piece of functionality that provided a solution for one of the Kaggle challenges using Python. This blogpost shows some improvements in the code itself, as well as the classification process:
- Removing some of the functionality that was available in public repositories (pandas)
- Adding logging capabilities
- Include quality evaluation and cross-validation
I think that every developer should periodically used more than one programming language and more than one programming paradigm to be knowledgable enough and to not develop a “tunnel vision” which makes us believe that some solutions are not possible just because our current paradigm does not support them.
For the last year or so, I have been using mainly one developing language (Clojure). Do not get me wrong, I believe Clojure is the future and I love it as a language, but I do think that being a polyglot developer is something we all should look forward to. Therefore, I have decided to fresh up my python skills going back to the basics and use it to solve one Kaggle competition.