Local Hadoop test cluster up and running
Thanks to Cloudera’s CDH3 image, I have a virtual machine with Hadoop on CentOS 5 working. I’m more of an Ubuntu guy, so CentOS is a new for me. But nothing Google couldn’t solve.
I also ran into a Hadoop exception about the java heap space. I couldn’t find a solution online so I just bumped up the memory on the virtual machine and it solved the problem.
In any case, I managed to run the pi calculation example on my local Hadoop cluster.
read moreBuilding a distributed back-tester with Hadoop on Amazon AWS
Testing is arguably the single most important aspect of trading system development. You can’t tell how well an idea works unless you test it out. Testing can also help you identify weaknesses or strengths in your model. The downside to testing is that it takes time to churn through those gigabytes of data.
Backtesting is inherently a linear process. You feed in your tick data into your algorithm and expect some actions. You can’t really make use of fork/join to let other threads steal from the process queue as the later process depends on results from the earlier calculations. However, often times than not, you’re interested in testing many variations of a strategy. This is where MapReduce comes into play.
MapReduce is a Google software framework. It is inspired by the map and reduce functions ubiquitous in functional programming. They are as common as for-loops in the Java world.
The map function partitions an input into smaller problems and run them concurrently, e.g. each of the strategy’s variant is executed on a node.
The reduce function takes the results from all the nodes and aggregate them to get an output, e.g. back-test results from each strategy.
Having used functional programming for some time now, using map/reduce is very natural for me. Where my knowledge falls short is in implementing a distributed infrastructure for running these map and reduce with massive scaling beyond my own multi-core computer.
It just so happens that Amazon AWS has a hosted Hadoop PaaS. Where Hadoop is the Apache’s framework for MapReduce. Hardware, check. Framework, check. This will be the first second system that I’ll be working on in my goal to build a complete trading R&D platform.
Expect some technical discussions in the coming months as I work my way through. Now, where should I start…
read moreTowards a broker-agnostic trading system
The first generation of my trading system uses a broker’s proprietary script on that same broker’s platform. The second generation of my trading system uses a generic programming language on another broker’s platform. I am now working on the third generation of my trading system. It is language-agnostic, platform-agnostic, and broker-agnostic.
Why? Because designing trading systems is as much about creativity as the science. An artist would not limit themselves to a defined toolset. I want to be free to use whatever tools are necessary to get the job done efficiently and being able to try new things as technologies evolve.
Having a broker-agnostic system frees me from being locked into any particular broker. Conversely, I could use one system to gather data and trade from multiple brokers for the best execution prices.
Yes, this is over-engineering for a small entity like myself. But heck, this is something that I enjoy building. So why not?
read moreVector algorithm using tree composition
Sniffed this trick from the Incanter source. Here’s a demo of using tree composition to calculate a 2-d Euclidean distance between two points.
(def x [1 2])
(def y [4 5])
(defn- tree-comp-each [root branch & leaves]
(apply
root (map branch leaves)))
(defn euclidean-distance
[a b]
{:pre [(= (count a) (count b))]}
(sqrt
(apply
tree-comp-each
+
(fn [[x y]]
(pow (- x y) 2))
(map vector a b))))
I’ve only had exposures to tree traversal in the use of implementing searching and sorting algorithms. This trick here definitely widened my eyes to the wonders of functional programming. It’s more than just being able to pass and manipulate functions.
I need to think like a tree.
read moreBack in R&D mode
Just some updates as to what I’m up to. I’ve been fully immersed in research and development mode for some new systems and technologies. This could take a few months.
I am also deliberating rather I should revert Quantisan.com back as a pure blog again and separate QTD on its own domain. Quantisan.com has too much random topics and history that makes it difficult to keep a consistent message as a company blog.
read moreFirst impression of Incanter: usably incomplete
Speaking out loud about my new favourite toys, Clojure (a functional programming language) and Incanter (a R-like statistical platform built on Clojure). My first impression is that Incanter is usable but it is far from being polished like R since it is a new platform. I am using the Gold Miners ETF (GDX) and Gold Trust (GLD) data from Yahoo Finance as examples to perform a simple correlation test. Some Incanter functions are rough on the edges and a lot is left to be desired. This is a drastic contrast to R in which its plug-in modules are both impressive and comprehensive. On the other hand, making things happen with Clojure is … fun. And that’s a winner for me.
Here’s a walk through of my first attempt using Incanter to analyse stocks data.
Figures 1 and 2 below are basic time series plot of GDX and GLD. I’ll need to study the Incanter source code to figure out how to plot multiple time series on the same chart. Couldn’t find it on a first glimpse though.

GDX

GLD
The following looks like a lot of code just to plot two graphs. Most of the boilerplate code is to coerce the Yahoo Finance CSV data to specifically fit Incanter’s time-series-plot function. I will wrap this into a library later on.
(ns inc-sandbox.corr-demo
(:require 1)
(:use (incanter core stats charts io))
(:require 1)
(:use (clj-time [format :only (formatter formatters parse)]
[coerce :only (to-long)])))
(defn sym-to-dataset
"returns a dataset read from a local CSV in './data/' given a Yahoo Finance symbol name"
[yf-symbol]
(let [+data "./data/"
+csv ".csv"
symbol (.toUpperCase yf-symbol)
filename (str +data symbol +csv)]
(-> (read-dataset
filename
:header true)
(col-names
[:Date :Open :High :Low :Close :Volume :Adj-Close]))))
(def gdx (sym-to-dataset "GDX"))
(def gld (sym-to-dataset "GLD"))
(defn same-dates?
"are two datasets covering the same time frame?"
[x y]
(let [x-dates (into #{} ($ :Date x))
y-dates (into #{} ($ :Date y))
x-y (clojure.set/difference x-dates y-dates)
y-x (clojure.set/difference y-dates x-dates)]
(and (empty? x-y) (empty? y-x))))
(same-dates? gdx gld) ; true
(def gdx-ac ($ :Adj-Close gdx))
(def gld-ac ($ :Adj-Close gld))
(defn dates-long
"returns the dates as long"
[data]
(let [ymd-formatter (formatters :year-month-day)
dates-str ($ :Date data)]
(map #(to-long (parse ymd-formatter %)) dates-str)))
;; no replace col func
(def gdx-times (dates-long gdx))
(def gld-times (dates-long gld))
(view (time-series-plot gld-times gld-ac
:x-label "Date"
:y-label "GLD"))
(view (time-series-plot gdx-times gdx-ac
:x-label "Date"
:y-label "GDX"))
Calculating the Pearson and the Spearman’s correlation coefficient are straightforward enough. It’s just a function call away. I am surprised to see Spearman’s rho implemented in Incanter already as its non-parametric statistics library is practically non-existent. Yet another project to work on.
However, calculating the coefficients is only half the story. Where are the p-values? That doesn’t seem to be available for these functions. The t-test function is a good example of what results I would like to see. A third item to work on.
(correlation gdx-ac gld-ac) ; 0.7906494552829249
(spearmans-rho gdx-ac gld-ac) ; 0.7728859703262337
;; no p-value in col, like t-test
(let [lm (linear-model gdx-ac gld-ac)]
(doto (scatter-plot gld-ac gdx-ac
:x-label "GLD"
:y-label "GDX")
(add-lines gld-ac (:fitted lm))
view))

GLD-GDX scatter plot
Even though Incanter is still early in its development, it is certainly a usable statistical platform offering many of the basics. I look forward to learning more about it and contributing to the project. Now, which of the listed problems should I tackle first?
P.S. My code embed seem to be mysteriously breaking my code. As an alternative, the complete source is available on Gist.
read more

