Speaking out loud about my new favourite toys, Clojure (a functional
programming language) and Incanter (a R-like statistical platform built
on Clojure). My first impression is that Incanter is usable but it is
far from being polished like R since it is a new platform. I am using
the Gold Miners ETF (GDX) and Gold Trust (GLD) data from Yahoo Finance
as examples to perform a simple correlation test. Some Incanter
functions are rough on the edges and a lot is left to be desired. This
is a drastic contrast to R in which its plug-in modules are both
impressive and comprehensive. On the other hand, making things happen
with Clojure is ... fun. And that's a winner for me. Here's a walk
through of my first attempt using Incanter to analyse stocks data.
Figures 1 and 2 below are basic time series plot of GDX and GLD. I'll
need to study the Incanter source code to figure out how to plot
multiple time series on the same chart. Couldn't find it on a first
glimpse though. [caption id="attachment_5535" align="aligncenter"
width="488" caption="GDX"][/caption] [caption id="attachment_5536"
align="aligncenter" width="488" caption="GLD"]
[/caption] The
following looks like a lot of code just to plot two graphs. Most of the
boilerplate code is to coerce the Yahoo Finance CSV data to specifically
fit Incanter's time-series-plot function. I will wrap this into a
library later on.
(ns inc-sandbox.corr-demo (:require
[clojure.set :as set]) (:use (incanter core stats charts io)) (:require
[clj-time.core :as time]) (:use (clj-time [format :only (formatter
formatters parse)] [coerce :only (to-long)]))) (defn sym-to-dataset
"returns a dataset read from a local CSV in './data/' given a Yahoo
Finance symbol name" [yf-symbol] (let [+data "./data/" +csv ".csv"
symbol (.toUpperCase yf-symbol) filename (str +data symbol +csv)] (-\>
(read-dataset filename :header true) (col-names [:Date :Open :High :Low
:Close :Volume :Adj-Close])))) (def gdx (sym-to-dataset "GDX")) (def gld
(sym-to-dataset "GLD")) (defn same-dates? "are two datasets covering the
same time frame?" [x y] (let [x-dates (into \#{} (\$ :Date x)) y-dates
(into \#{} (\$ :Date y)) x-y (clojure.set/difference x-dates y-dates)
y-x (clojure.set/difference y-dates x-dates)] (and (empty? x-y) (empty?
y-x)))) (same-dates? gdx gld) ; true (def gdx-ac (\$ :Adj-Close gdx))
(def gld-ac (\$ :Adj-Close gld)) (defn dates-long "returns the dates as
long" [data] (let [ymd-formatter (formatters :year-month-day) dates-str
(\$ :Date data)] (map \#(to-long (parse ymd-formatter %)) dates-str)))
;; no replace col func (def gdx-times (dates-long gdx)) (def gld-times
(dates-long gld)) (view (time-series-plot gld-times gld-ac :x-label
"Date" :y-label "GLD")) (view (time-series-plot gdx-times gdx-ac
:x-label "Date" :y-label "GDX"))
Calculating the Pearson and the Spearman's correlation coefficient are straightforward enough. It's just a function call away. I am surprised to see Spearman's rho implemented in Incanter already as its non-parametric statistics library is practically non-existent. Yet another project to work on. However, calculating the coefficients is only half the story. Where are the p-values? That doesn't seem to be available for these functions. The t-test function is a good example of what results I would like to see. A third item to work on.
(correlation gdx-ac gld-ac)
; 0.7906494552829249 (spearmans-rho gdx-ac gld-ac) ; 0.7728859703262337
;; no p-value in col, like t-test (let [lm (linear-model gdx-ac gld-ac)]
(doto (scatter-plot gld-ac gdx-ac :x-label "GLD" :y-label "GDX")
(add-lines gld-ac (:fitted lm)) view))
[caption id="attachment_5543" align="aligncenter" width="488" caption="GLD-GDX
scatter plot"][/caption] Even though Incanter is still early in
its development, it is certainly a usable statistical platform offering
many of the basics. I look forward to learning more about it and
contributing to the project. Now, which of the listed problems should I
tackle first? P.S. My code embed seem to be mysteriously breaking my
code. As an alternative, the complete source is available on Gist.