Unlock Lisp sorcery in your data structure by implementing Clojure ISeq

People that has gone through The Little Schemer might not find this exciting. One of the things that I discovered while patching Clatrix is that implementing clojure.lang.ISeq interface in your custom data structure unlocks the magic of Lisp composition. By enabling primative operators such as first, next, more, cons, higher-level operations such as map and reduce would just work when operating on your data structure. I find it fascinating that a native Fortran matrix object (through jBLAS) can be made clojury with a few magic operations implemented.

However, getting a deftype implementation of Matrix correct took some effort as these operators are not as simple as they seem.

public interface ISeq extends IPersistentCollection {
    Object first();
    ISeq next();
    ISeq more();
    ISeq cons(Object o);
}

For example, say we have a matrix M like so.

=> (def M (matrix [[1 2 3] [4 5 6] [7 8 9]]))
=> M
A 3x3 matrix
-------------
1.00e+00  2.00e+00  3.00e+00 
4.00e+00  5.00e+00  6.00e+00 
7.00e+00  8.00e+00  9.00e+00

Reducing M across its maps is equivalent to a column-wise operation.

=> (reduce #(map + %1 %2) M)
(12.0 15.0 18.0)

Yet for a while this doesn't work because I wasn't careful on my implementation of first.

Consider the case of a 2x2 matrix. A 2x2 matrix is structurally equivalent to a nested vector. Calling first on these would yield:

=> (first [[1 2] [3 4]])
[1 2]
=> (first (matrix [[1 2] [3 4]]))
A 1x2 matrix
-------------
1.00e+00  2.00e+00

And for a 3x1 vector matrix, i.e. one-dimensional, it is equivalent to a regular vector.

=> (first [1 2 3])
1
=> (first (matrix [1 2 3]))
1.0

But here's a tricky bit. What happens during reduce as it keeps recurring next and first?

Let's step this through for (reduce #(map + %1 %2) M). %1 is basically the result so far and %2 is the first of the remaining collection to be processed.

iterationaccumulated (%1)first (%2)remaining
0nil[1 2 3][[1 2 3] [4 5 6] [7 8 9]]
1[1 2 3][4 5 6][[4 5 6] [7 8 9]]
2, bad[5 7 9]7[[7 8 9]]
2, good[5 7 9][7 8 9][[7 8 9]]

The problem arises in the second iteration. Calling (rest [[4 5 6] [7 8 9]]) returns [[7 8 9]]. However, (matrix [[7 8 9]]) is a row vector and (matrix [7 8 9]) is a column vector. Both are considered one dimensional. In either case, first of a vector should return the first element, which is a number. Thus at this iteration, reduce breaks because you can't map a sequence with a number, (map + [5 7 9] 7), to get an accumulated value.

What we want though, is for the second iteration to return [7 8 9] instead because the original matrix is not a vector. Luckily, this particular problem has been solved by my colleague Antonio Garrote when he did this in Java a year ago by keeping a predicate field signifying is this matrix supposed to be vector or not.

So there you have it. If you find yourself needing to implement deftype to build your own data structure in Clojure. Do consider implementing clojure.lang.ISeq to leverage high-level Clojure functions but be careful about those seemingly simplistic primitive operators.

A hypothetical data analysis platform

My definition of a statistical platform is that it is a glue that ties orthogonal data analysis functions together. Take R for instance, it is a platform-as-application. You fire up R and everything is accessible to you. However, all the packages only work on top of R.

Python, on the other hand, take a platform-as-libraries approach. A basic data analaysis setup is to pip install Numpy, Scipy, Matplotlib. High-level libraries, such as scikit-learn and pandas, are built on top of these. It is somewhat more flexible for picking and choosing but the dependency is still a tree-like structure between some packages.

Then there's Incanter.

You don't like to use Parallel Colt for your matrices? Here, try this BLAS drop-in replacement and everything would just work with 10x speed.

Much of this flexibility is due to earlier design choices by Liebke et al. to leverage Clojure's idiom that "it is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures."

The thing is, I think we're only scratching the surface. Excuse me while I dream for a minute.

Say instead of jBLAS, you want to use CPU/GPU hybrid instead. Suppose you can just do a (use 'incanter-magma) and your Incanter code would just run with MAGMA (via Mathew Rocklin) under the hood without any other change.

Taking this idea of interfacing libraries into a hypothetical use case. Imagine that you cleaned and structured your data on Hadoop using Cascalog and is looking to analyse this dataset. You start your Incanter session to pull in your data (use 'incanter-cascalog). Write some Incanter script to interrogate this dataset but find the data is still too big for your laptop. So you (use 'incanter-storm) to make use of distributed processing instead. Incanter would then flow data directly from Cascalog to Storm inside your cluster.

For your results, you find JFreeChart limiting so you (use 'incanter-c2) to spiff up your visualisations with C2 all while not changing a single line of your Incanter script.

Instead of the star-like dependency of R and its packages, or the tree-like structure for Python and its packages, Incanter could be an interface to stand-alone libraries encapsulated by an application for the user.

Incanter, the library, could be modules that transform data into standard Incanter-compatible data structures to and from external libraries. Incanter, the application, could be a domain specific language, a client, and a in-REPL package manager.

Another benefit to this is that it helps to mitigate the developer shortage problem for Incanter too by making use of external, stand-alone libraries.

I call this platform-as-interface.

Vector algorithm using tree composition

Sniffed this trick from the Incanter source. Here's a demo of using tree composition to calculate a 2-d Euclidean distance between two points.

(def x [1 2]) 
(def y [4 5]) 
(defn- tree-comp-each [root branch & leaves]
    (apply root (map branch leaves))) 
(defn euclidean-distance [a b]
    {:pre [(= (count a) (count b))]} 
    (sqrt (apply tree-comp-each + 
                (fn [[x y]]
                    (pow (- x y) 2)) 
                (map vector a b))))

I've only had exposures to tree traversal in the use of implementing searching and sorting algorithms. This trick here definitely widened my eyes to the wonders of functional programming. It's more than just being able to pass and manipulate functions. I need to think like a tree.

First impression of Incanter: usably incomplete

Speaking out loud about my new favourite toys, Clojure (a functional programming language) and Incanter (a R-like statistical platform built on Clojure). My first impression is that Incanter is usable but it is far from being polished like R since it is a new platform. I am using the Gold Miners ETF (GDX) and Gold Trust (GLD) data from Yahoo Finance as examples to perform a simple correlation test. Some Incanter functions are rough on the edges and a lot is left to be desired. This is a drastic contrast to R in which its plug-in modules are both impressive and comprehensive. On the other hand, making things happen with Clojure is ... fun. And that's a winner for me. Here's a walk through of my first attempt using Incanter to analyse stocks data. Figures 1 and 2 below are basic time series plot of GDX and GLD. I'll need to study the Incanter source code to figure out how to plot multiple time series on the same chart. Couldn't find it on a first glimpse though. [caption id="attachment_5535" align="aligncenter" width="488" caption="GDX"][/caption] [caption id="attachment_5536" align="aligncenter" width="488" caption="GLD"][/caption] The following looks like a lot of code just to plot two graphs. Most of the boilerplate code is to coerce the Yahoo Finance CSV data to specifically fit Incanter's time-series-plot function. I will wrap this into a library later on.

(ns inc-sandbox.corr-demo (:require
[clojure.set :as set]) (:use (incanter core stats charts io)) (:require
[clj-time.core :as time]) (:use (clj-time [format :only (formatter
formatters parse)] [coerce :only (to-long)]))) (defn sym-to-dataset
"returns a dataset read from a local CSV in './data/' given a Yahoo
Finance symbol name" [yf-symbol] (let [+data "./data/" +csv ".csv"
symbol (.toUpperCase yf-symbol) filename (str +data symbol +csv)] (-\>
(read-dataset filename :header true) (col-names [:Date :Open :High :Low
:Close :Volume :Adj-Close])))) (def gdx (sym-to-dataset "GDX")) (def gld
(sym-to-dataset "GLD")) (defn same-dates? "are two datasets covering the
same time frame?" [x y] (let [x-dates (into \#{} (\$ :Date x)) y-dates
(into \#{} (\$ :Date y)) x-y (clojure.set/difference x-dates y-dates)
y-x (clojure.set/difference y-dates x-dates)] (and (empty? x-y) (empty?
y-x)))) (same-dates? gdx gld) ; true (def gdx-ac (\$ :Adj-Close gdx))
(def gld-ac (\$ :Adj-Close gld)) (defn dates-long "returns the dates as
long" [data] (let [ymd-formatter (formatters :year-month-day) dates-str
(\$ :Date data)] (map \#(to-long (parse ymd-formatter %)) dates-str)))
;; no replace col func (def gdx-times (dates-long gdx)) (def gld-times
(dates-long gld)) (view (time-series-plot gld-times gld-ac :x-label
"Date" :y-label "GLD")) (view (time-series-plot gdx-times gdx-ac
:x-label "Date" :y-label "GDX"))

Calculating the Pearson and the Spearman's correlation coefficient are straightforward enough. It's just a function call away. I am surprised to see Spearman's rho implemented in Incanter already as its non-parametric statistics library is practically non-existent. Yet another project to work on. However, calculating the coefficients is only half the story. Where are the p-values? That doesn't seem to be available for these functions. The t-test function is a good example of what results I would like to see. A third item to work on.

 (correlation gdx-ac gld-ac)
; 0.7906494552829249 (spearmans-rho gdx-ac gld-ac) ; 0.7728859703262337
;; no p-value in col, like t-test (let [lm (linear-model gdx-ac gld-ac)]
(doto (scatter-plot gld-ac gdx-ac :x-label "GLD" :y-label "GDX")
(add-lines gld-ac (:fitted lm)) view))

[caption id="attachment_5543" align="aligncenter" width="488" caption="GLD-GDX scatter plot"][/caption] Even though Incanter is still early in its development, it is certainly a usable statistical platform offering many of the basics. I look forward to learning more about it and contributing to the project. Now, which of the listed problems should I tackle first? P.S. My code embed seem to be mysteriously breaking my code. As an alternative, the complete source is available on Gist.