How to manipulate Kaggle data in Clojure: CSV to EDN parser

My background as a developer for the last 10 years or so (I am getting older…) has been based mainly on Java and Python, with minor incursions in other languages. However, my contact with the functional programming paradigm was limited to my Artificial Intelligence assignments in Lisp at the university. Functional programming is now a key part of my life as a developer. I have been using Clojure for most part of the last year and I can say that I believe functional programming in general, and Clojure in particular are amazing tools to solve several problems for which other paradigms provide inelegant solutions.

Information Retrieval systems is a good case in which different algorithms (i.e., functions) such as different weightings or scoring models are applied and tested. A more detail post on Clojure for Text Analytics will come at some point in the future, until then, I encourage everyone who is curious about it to enter 4clojure, a “gamified” way of learning to code in Clojure.

This post shows an example of how to process the CSV input data for the Kaggle digit recognition challenge. The main goal of this task is to be able to automatically identity hand-written numbers.
In particular, this piece of code read the csv files (40,000 lines in one of the cases), transform them into a more usable data structure, and finally, how to serialise this new structure in EDN (a native format for Clojure). The initial csv format is the following:

label, pixel1, pixel2, … pixel_n
1, 0, 0, 255, …. 0
2, 0, 144, 255, …. 230

The first line defines the headers, including the name of all the features for which we have data (the darkness of each pixel in this specific case). All the subsequent lines define each instance (i.e., images of hand-written numbers) and the darkness value for each one of the pixels in the image. We aim to create a piece of code that can process any csv file with the feature names in the first line and one instance per line, where the first “column” represents the category that instance belongs to, and every other value is a feature weight. An important note is that we will assume (for this example) that all the values are numerical, and that there are no missing values. All the instances have a value for each of the features.

The first part of the code focuses on parsing the input files with the function csv-collection-to-edn. This function output the result of parsing the CSV file into an EDN file using spit. The parsing of the individual lines will depend on the format of the file (with or without topic information in the first position of the row). The train file has labeled data, while the testing one does not.


(defn csv-collection-to-edn
  "Dumps the csv test collection into an edn file using a
   splitter function. If labelled? is true, the first element
   of each row is considered to be the category information."
  [csv-file edn-file labelled?]
  (if labelled?
     (spit edn-file (pr-str
         (parse-feature-csv csv-file labelled-line-splitter)
     (spit edn-file (pr-str
         (parse-feature-csv csv-file unlabelled-line-splitter))))

The next functions to show would be parse-feature-csv and sparse-feature-map which have the responsibility of transforming the CSV file into a sparse hash-map with the values for each feature with a value different than zero. The generic parsing from csv is done using the clojure.data.csv module and the generation of the final map per line is obtained by applying a zipmap (creating a hashmap with the elements of the first sequence as the keys for the values in the second). This structure represents all the features per instance in the map. However, the data is very sparse and storing so many zero values its a waste of space. For this reason, this function applies remove to ignore all the elements in the map with a value of zero. Also, sparse-feature-map illustrates one nice feature of clojure, the preconditions:

{:pre [(= (count features) (count values))]}

The semantics of this (very compact) line is that the number of features and the number of values must be the same for the function to work, otherwise, it will throw an Exception.


(ns kaggle.digits-recognition.parse-collection
  (require [clojure.java.io :as io]
           [clojure.data.csv :as csv]))

(defn sparse-feature-map
  "Create a sparse (i.e., removing any features with value 0)
   hash-map representation based on a set of features. The value
   of all features are assumed to be numerical and
  no missing features are allowed."
  [features values]
  {:pre [(= (count features) (count values))]}
  (into {} (remove (comp zero? val)
        (zipmap features (map read-string values)))))

(defn labelled-line-splitter
  "Returns the topic of an instance and its hashmap with features
   and weights given the sequence of elements in a line
   representing a train example."
  [labels line]
  (vector (first line) (sparse-feature-map labels (rest line))))

(defn unlabelled-line-splitter
  "Returns the hashmap of features and weights given the sequence
   of elements in a line representing a test example."
  [labels line]
  (sparse-feature-map labels (rest line)))

(defn parse-feature-csv
  "Processes a csv file and returns the seq of instances per topic."
  [csv-file-name line-splitter]
  (with-open [in-file (io/reader csv-file-name)]
    (let [csv-content (csv/read-csv in-file)
          labels (rest (first csv-content))]
      (->> (rest csv-content)
           (pmap (partial line-splitter labels))
           doall))))

(defn csv-collection-to-edn
  "Dumps the csv test collection into an edn file using a
   splitter function."
  [csv-file edn-file labelled?]
  (if labelled?
    (spit edn-file (pr-str
      (parse-feature-csv csv-file labelled-line-splitter)))
    (spit edn-file (pr-str
      (parse-feature-csv csv-file unlabelled-line-splitter)))))

This post does not intend to be a class of Clojure, but to show in a very simple, and definitely non-perfect, example of what Clojure can offer, specially in the data science space. I promise to do a more extensive, and more advance post in the near future. All the code required for the conversion is shown below. I will probably uploaded to GitHub soon as well once I extend the functionality a bit more.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s