Building a distributed back-tester with Hadoop on Amazon AWS

Testing is arguably the single most important aspect of trading system development. You can’t tell how well an idea works unless you test it out. Testing can also help you identify weaknesses or strengths in your model. The downside to testing is that it takes time to churn through those gigabytes of data.

Backtesting is inherently a linear process. You feed in your tick data into your algorithm and expect some actions. You can’t really make use of fork/join to let other threads steal from the process queue as the later process depends on results from the earlier calculations. However, often times than not, you’re interested in testing many variations of a strategy. This is where MapReduce comes into play.

MapReduce is a Google software framework. It is inspired by the map and reduce functions ubiquitous in functional programming. They are as common as for-loops in the Java world.

The map function partitions an input into smaller problems and run them concurrently, e.g. each of the strategy’s variant is executed on a node.

The reduce function takes the results from all the nodes and aggregate them to get an output, e.g. back-test results from each strategy.

Having used functional programming for some time now, using map/reduce is very natural for me. Where my knowledge falls short is in implementing a distributed infrastructure for running these map and reduce with massive scaling beyond my own multi-core computer.

It just so happens that Amazon AWS has a hosted Hadoop PaaS. Where Hadoop is the Apache’s framework for MapReduce. Hardware, check. Framework, check. This will be the first second system that I’ll be working on in my goal to build a complete trading R&D platform.

Expect some technical discussions in the coming months as I work my way through. Now, where should I start…

read more

Towards a broker-agnostic trading system

The first generation of my trading system uses a broker’s proprietary script on that same broker’s platform. The second generation of my trading system uses a generic programming language on another broker’s platform. I am now working on the third generation of my trading system. It is language-agnostic, platform-agnostic, and broker-agnostic.

Why? Because designing trading systems is as much about creativity as the science. An artist would not limit themselves to a defined toolset. I want to be free to use whatever tools are necessary to get the job done efficiently and being able to try new things as technologies evolve.

Having a broker-agnostic system frees me from being locked into any particular broker. Conversely, I could use one system to gather data and trade from multiple brokers for the best execution prices.

Yes, this is over-engineering for a small entity like myself. But heck, this is something that I enjoy building. So why not?

read more

Vector algorithm using tree composition

Sniffed this trick from the Incanter source. Here’s a demo of using tree composition to calculate a 2-d Euclidean distance between two points.

(def x [1 2])
(def y [4 5])

(defn- tree-comp-each [root branch & leaves]
 (apply
  root (map branch leaves)))

(defn euclidean-distance
	[a b]
	{:pre [(= (count a) (count b))]}
	(sqrt
  	(apply
   		tree-comp-each
   		+
  		(fn [[x y]]
    		(pow (- x y) 2))
  		(map vector a b))))

I’ve only had exposures to tree traversal in the use of implementing searching and sorting algorithms. This trick here definitely widened my eyes to the wonders of functional programming. It’s more than just being able to pass and manipulate functions.

I need to think like a tree.

read more

First impression of Incanter: usably incomplete

Speaking out loud about my new favourite toys, Clojure (a functional programming language) and Incanter (a R-like statistical platform built on Clojure). My first impression is that Incanter is usable but it is far from being polished like R since it is a new platform. I am using the Gold Miners ETF (GDX) and Gold Trust (GLD) data from Yahoo Finance as examples to perform a simple correlation test. Some Incanter functions are rough on the edges and a lot is left to be desired. This is a drastic contrast to R in which its plug-in modules are both impressive and comprehensive. On the other hand, making things happen with Clojure is … fun. And that’s a winner for me.

Here’s a walk through of my first attempt using Incanter to analyse stocks data.

Figures 1 and 2 below are basic time series plot of GDX and GLD. I’ll need to study the Incanter source code to figure out how to plot multiple time series on the same chart. Couldn’t find it on a first glimpse though.

GDX

GLD

The following looks like a lot of code just to plot two graphs. Most of the boilerplate code is to coerce the Yahoo Finance CSV data to specifically fit Incanter’s time-series-plot function. I will wrap this into a library later on.

(ns inc-sandbox.corr-demo
  (:require 1)
  (:use (incanter core stats charts io))
  (:require 1)
  (:use (clj-time [format :only (formatter formatters parse)]
          				[coerce :only (to-long)])))

(defn sym-to-dataset
  "returns a dataset read from a local CSV in './data/' given a Yahoo Finance symbol name"
  [yf-symbol]
  (let [+data	"./data/"
        +csv		".csv"
				symbol	(.toUpperCase yf-symbol)
        filename	(str +data symbol +csv)]
    (-> (read-dataset
          filename
          :header true)
      (col-names
        [:Date :Open :High :Low :Close :Volume :Adj-Close]))))

(def gdx (sym-to-dataset "GDX"))
(def gld (sym-to-dataset "GLD"))

(defn same-dates?
  "are two datasets covering the same time frame?"
  [x y]
  (let [x-dates		(into #{} ($ :Date x))
        y-dates		(into #{} ($ :Date y))
        x-y				(clojure.set/difference x-dates y-dates)
        y-x				(clojure.set/difference y-dates x-dates)]
    (and (empty? x-y) (empty? y-x))))

(same-dates? gdx gld)		; true

(def gdx-ac ($ :Adj-Close gdx))
(def gld-ac ($ :Adj-Close gld))

(defn dates-long
  "returns the dates as long"
  [data]
  (let [ymd-formatter (formatters :year-month-day)
        dates-str			($ :Date data)]
    (map #(to-long (parse ymd-formatter %)) dates-str)))        

;; no replace col func  

(def gdx-times (dates-long gdx))
(def gld-times (dates-long gld))

(view (time-series-plot gld-times gld-ac
        :x-label "Date"
        :y-label "GLD"))
(view (time-series-plot gdx-times gdx-ac
        :x-label "Date"
        :y-label "GDX"))

Calculating the Pearson and the Spearman’s correlation coefficient are straightforward enough. It’s just a function call away. I am surprised to see Spearman’s rho implemented in Incanter already as its non-parametric statistics library is practically non-existent. Yet another project to work on.

However, calculating the coefficients is only half the story. Where are the p-values? That doesn’t seem to be available for these functions. The t-test function is a good example of what results I would like to see. A third item to work on.

(correlation gdx-ac gld-ac)		; 0.7906494552829249
(spearmans-rho gdx-ac gld-ac)	; 0.7728859703262337
;; no p-value in col, like t-test

(let [lm (linear-model gdx-ac gld-ac)]
	  (doto (scatter-plot gld-ac gdx-ac
           :x-label "GLD"
           :y-label "GDX")
     (add-lines gld-ac (:fitted lm))
     view))

GLD-GDX scatter plot

Even though Incanter is still early in its development, it is certainly a usable statistical platform offering many of the basics. I look forward to learning more about it and contributing to the project. Now, which of the listed problems should I tackle first?

P.S. My code embed seem to be mysteriously breaking my code. As an alternative, the complete source is available on Gist.

read more

It’s an open buffet in a small business

One of the few benefits of working for myself is that I don’t need to worry about compatibility with legacy systems. I am free to use whatever open source tools to get the job done well. The downside to this is that there are so many technologies out there that it’s hard to choose the right ones for the job. To give you a sense of what I meant, here are some of the topics that I have either tried or seriously considered in the past year.

Programming language choices are important because they sit at the bottom of a technology stack. Most of the work that I do are built on using them. For a long while, I settled on using a combination of Java, Python, and R. I prototype small ideas in Python. Implement production code in Java. And perform analysis in R. I discussed why use the right tool for the right task a year ago.

By the end of my previous project, I am finding that the popular development triplet of Java, Python, and R, is not ideal for a solo-operation. Seeing that I have more time on my hands because I am using QTD to trade for me now, I am taking a break this summer to expand my knowledge and learn new technologies.

Some of the technologies that I am experimenting with includes:

  • an in-memory data store for instantaneous and persistent tick data
  • parallel programming for concurrent processing with no locks
  • mathematically intuitive algorithm implementations using high order functions

Don’t mind me as I help myself in this open buffet of technologies.

read more