Speaking out loud about my new favourite toys, Clojure (a functional programming language) and Incanter (a R-like statistical platform built on Clojure). My first impression is that Incanter is usable but it is far from being polished like R since it is a new platform. I am using the Gold Miners ETF (GDX) and Gold Trust (GLD) data from Yahoo Finance as examples to perform a simple correlation test. Some Incanter functions are rough on the edges and a lot is left to be desired. This is a drastic contrast to R in which its plug-in modules are both impressive and comprehensive. On the other hand, making things happen with Clojure is ... fun. And that's a winner for me. Here's a walk through of my first attempt using Incanter to analyse stocks data. Figures 1 and 2 below are basic time series plot of GDX and GLD. I'll need to study the Incanter source code to figure out how to plot multiple time series on the same chart. Couldn't find it on a first glimpse though. [caption id="attachment_5535" align="aligncenter" width="488" caption="GDX"][/caption] [caption id="attachment_5536" align="aligncenter" width="488" caption="GLD"][/caption] The following looks like a lot of code just to plot two graphs. Most of the boilerplate code is to coerce the Yahoo Finance CSV data to specifically fit Incanter's time-series-plot function. I will wrap this into a library later on.
(ns inc-sandbox.corr-demo (:require
[clojure.set :as set]) (:use (incanter core stats charts io)) (:require
[clj-time.core :as time]) (:use (clj-time [format :only (formatter
formatters parse)] [coerce :only (to-long)]))) (defn sym-to-dataset
"returns a dataset read from a local CSV in './data/' given a Yahoo
Finance symbol name" [yf-symbol] (let [+data "./data/" +csv ".csv"
symbol (.toUpperCase yf-symbol) filename (str +data symbol +csv)] (-\>
(read-dataset filename :header true) (col-names [:Date :Open :High :Low
:Close :Volume :Adj-Close])))) (def gdx (sym-to-dataset "GDX")) (def gld
(sym-to-dataset "GLD")) (defn same-dates? "are two datasets covering the
same time frame?" [x y] (let [x-dates (into \#{} (\$ :Date x)) y-dates
(into \#{} (\$ :Date y)) x-y (clojure.set/difference x-dates y-dates)
y-x (clojure.set/difference y-dates x-dates)] (and (empty? x-y) (empty?
y-x)))) (same-dates? gdx gld) ; true (def gdx-ac (\$ :Adj-Close gdx))
(def gld-ac (\$ :Adj-Close gld)) (defn dates-long "returns the dates as
long" [data] (let [ymd-formatter (formatters :year-month-day) dates-str
(\$ :Date data)] (map \#(to-long (parse ymd-formatter %)) dates-str)))
;; no replace col func (def gdx-times (dates-long gdx)) (def gld-times
(dates-long gld)) (view (time-series-plot gld-times gld-ac :x-label
"Date" :y-label "GLD")) (view (time-series-plot gdx-times gdx-ac
:x-label "Date" :y-label "GDX"))
Calculating the Pearson and the Spearman's correlation coefficient are straightforward enough. It's just a function call away. I am surprised to see Spearman's rho implemented in Incanter already as its non-parametric statistics library is practically non-existent. Yet another project to work on. However, calculating the coefficients is only half the story. Where are the p-values? That doesn't seem to be available for these functions. The t-test function is a good example of what results I would like to see. A third item to work on.
(correlation gdx-ac gld-ac)
; 0.7906494552829249 (spearmans-rho gdx-ac gld-ac) ; 0.7728859703262337
;; no p-value in col, like t-test (let [lm (linear-model gdx-ac gld-ac)]
(doto (scatter-plot gld-ac gdx-ac :x-label "GLD" :y-label "GDX")
(add-lines gld-ac (:fitted lm)) view))
[caption id="attachment_5543" align="aligncenter" width="488" caption="GLD-GDX scatter plot"][/caption] Even though Incanter is still early in its development, it is certainly a usable statistical platform offering many of the basics. I look forward to learning more about it and contributing to the project. Now, which of the listed problems should I tackle first? P.S. My code embed seem to be mysteriously breaking my code. As an alternative, the complete source is available on Gist.