My 5 minute lightning talk on Cascalog

Cascalog makes it a lot simpler to build distributed strategy backtesters on terabytes of market data, for example. It is a data processing library for building MapReduce jobs. I’ve been spiking out a data processing project with it at work for the past couple of weeks. So I thought I might as well give a lightning talk about it at our monthly developers meetup. Here are my presentation slides introducing Cascalog and outlining its features.

The possibilities…

read more

Local Hadoop test cluster up and running

Thanks to Cloudera’s CDH3 image, I have a virtual machine with Hadoop on CentOS 5 working. I’m more of an Ubuntu guy, so CentOS is a new for me. But nothing Google couldn’t solve.

I also ran into a Hadoop exception about the java heap space. I couldn’t find a solution online so I just bumped up the memory on the virtual machine and it solved the problem.

In any case, I managed to run the pi calculation example on my local Hadoop cluster.

read more

First impression of Incanter: usably incomplete

Speaking out loud about my new favourite toys, Clojure (a functional programming language) and Incanter (a R-like statistical platform built on Clojure). My first impression is that Incanter is usable but it is far from being polished like R since it is a new platform. I am using the Gold Miners ETF (GDX) and Gold Trust (GLD) data from Yahoo Finance as examples to perform a simple correlation test. Some Incanter functions are rough on the edges and a lot is left to be desired. This is a drastic contrast to R in which its plug-in modules are both impressive and comprehensive. On the other hand, making things happen with Clojure is … fun. And that’s a winner for me.

Here’s a walk through of my first attempt using Incanter to analyse stocks data.

Figures 1 and 2 below are basic time series plot of GDX and GLD. I’ll need to study the Incanter source code to figure out how to plot multiple time series on the same chart. Couldn’t find it on a first glimpse though.

GDX

GLD

The following looks like a lot of code just to plot two graphs. Most of the boilerplate code is to coerce the Yahoo Finance CSV data to specifically fit Incanter’s time-series-plot function. I will wrap this into a library later on.

(ns inc-sandbox.corr-demo
  (:require 1)
  (:use (incanter core stats charts io))
  (:require 1)
  (:use (clj-time [format :only (formatter formatters parse)]
          				[coerce :only (to-long)])))

(defn sym-to-dataset
  "returns a dataset read from a local CSV in './data/' given a Yahoo Finance symbol name"
  [yf-symbol]
  (let [+data	"./data/"
        +csv		".csv"
				symbol	(.toUpperCase yf-symbol)
        filename	(str +data symbol +csv)]
    (-> (read-dataset
          filename
          :header true)
      (col-names
        [:Date :Open :High :Low :Close :Volume :Adj-Close]))))

(def gdx (sym-to-dataset "GDX"))
(def gld (sym-to-dataset "GLD"))

(defn same-dates?
  "are two datasets covering the same time frame?"
  [x y]
  (let [x-dates		(into #{} ($ :Date x))
        y-dates		(into #{} ($ :Date y))
        x-y				(clojure.set/difference x-dates y-dates)
        y-x				(clojure.set/difference y-dates x-dates)]
    (and (empty? x-y) (empty? y-x))))

(same-dates? gdx gld)		; true

(def gdx-ac ($ :Adj-Close gdx))
(def gld-ac ($ :Adj-Close gld))

(defn dates-long
  "returns the dates as long"
  [data]
  (let [ymd-formatter (formatters :year-month-day)
        dates-str			($ :Date data)]
    (map #(to-long (parse ymd-formatter %)) dates-str)))        

;; no replace col func  

(def gdx-times (dates-long gdx))
(def gld-times (dates-long gld))

(view (time-series-plot gld-times gld-ac
        :x-label "Date"
        :y-label "GLD"))
(view (time-series-plot gdx-times gdx-ac
        :x-label "Date"
        :y-label "GDX"))

Calculating the Pearson and the Spearman’s correlation coefficient are straightforward enough. It’s just a function call away. I am surprised to see Spearman’s rho implemented in Incanter already as its non-parametric statistics library is practically non-existent. Yet another project to work on.

However, calculating the coefficients is only half the story. Where are the p-values? That doesn’t seem to be available for these functions. The t-test function is a good example of what results I would like to see. A third item to work on.

(correlation gdx-ac gld-ac)		; 0.7906494552829249
(spearmans-rho gdx-ac gld-ac)	; 0.7728859703262337
;; no p-value in col, like t-test

(let [lm (linear-model gdx-ac gld-ac)]
	  (doto (scatter-plot gld-ac gdx-ac
           :x-label "GLD"
           :y-label "GDX")
     (add-lines gld-ac (:fitted lm))
     view))

GLD-GDX scatter plot

Even though Incanter is still early in its development, it is certainly a usable statistical platform offering many of the basics. I look forward to learning more about it and contributing to the project. Now, which of the listed problems should I tackle first?

P.S. My code embed seem to be mysteriously breaking my code. As an alternative, the complete source is available on Gist.

read more

It’s an open buffet in a small business

One of the few benefits of working for myself is that I don’t need to worry about compatibility with legacy systems. I am free to use whatever open source tools to get the job done well. The downside to this is that there are so many technologies out there that it’s hard to choose the right ones for the job. To give you a sense of what I meant, here are some of the topics that I have either tried or seriously considered in the past year.

Programming language choices are important because they sit at the bottom of a technology stack. Most of the work that I do are built on using them. For a long while, I settled on using a combination of Java, Python, and R. I prototype small ideas in Python. Implement production code in Java. And perform analysis in R. I discussed why use the right tool for the right task a year ago.

By the end of my previous project, I am finding that the popular development triplet of Java, Python, and R, is not ideal for a solo-operation. Seeing that I have more time on my hands because I am using QTD to trade for me now, I am taking a break this summer to expand my knowledge and learn new technologies.

Some of the technologies that I am experimenting with includes:

  • an in-memory data store for instantaneous and persistent tick data
  • parallel programming for concurrent processing with no locks
  • mathematically intuitive algorithm implementations using high order functions

Don’t mind me as I help myself in this open buffet of technologies.

read more

Encrypting your communications to minimize eavedropping

The case of Galleon Group founder Raj Rajaratnam shows a corporate security crack that shouldn’t happen in this day and age. In the Rajaratnam case, government wiretaps served as vital evidence against him. The question though, is that, if the government is able to do that on a multi-billion dollar company without them knowing, what about their competitors or enemies with ill-intent? With all the resources available to them, how can something so old-school like this happen to a hedge fund? Secrecy and conspiracy shouldn’t be unfamiliar in the hedge fund industry. So I had assumed that they operated inside an iron wall.

Government wiretapping or corporate espionage are certainly not my worry. However, I still try to employ prudent security procedures in the financial quantitative research and development that I do at Quantisan Systems just because I value my intellectual property.

For example, in my email correspondence with business associates, I make use of GPG with a 2048-bit RSA asynchronous key pair. (my public key is below) GPG is an open-source equivalent of Pretty Good Privacy. It is most likely that your email client supports it too. I use Thunderbird and it is a built-in feature. You just need to enable it and set it up.

On voice communications, I prefer to use Skype rather than regular voice calls because it’s free and Skype inherently uses 256-bit AES encryption. So no setup is even required for this added security.

For my important files, I place them in a TrueCrypt volume encrypted with 256-bit AES encryption. The encrypted container is synchronized to my Dropbox account so that I can use it anywhere while ensuring security. TrueCrypt is available on Windows, Mac, and Linux. It’s only a matter of following the installation instructions to safeguard your files from prying eyes.

All of these encryption tools are open-source and free. I configured them to work in the background and they don’t hinder my work flow at all. It is certainly not a sophisticated setup, yet it isn’t too dingy either. At the very least, if someone is to steal my notebook computer, an ordinary crook wouldn’t be able to gain access to my sensitive information.

In view of the fact that Quantisan Systems is just an one-man operation with practically no resource available, I can’t believe security at some financial companies is worse than this homebrew setup.

—–BEGIN PGP PUBLIC KEY BLOCK—–
Version: GnuPG v1.4.11 (GNU/Linux)

mQENBE1Ul0gBCACZGqPoYrzRF9YcuUvMeES+etPUAeNQ6uvw42k7Sf1ydBp3lFGQ
ggYnOsmQ8Nf/62sB9CoYkAV+h+CPTuTOT5N4gwTojtPX3bIrjA6JgTDpt7bpwyMM
j+mnZboXdBgZwz1xqiB38Oj51OBYPPLbTc/YphYfjXEuO4tQwsQufd5dvJqF+Yse
CGeLRroJt9EmGVYW/NaCEBKHaMnD9gkDyWLxm7GFQYhl7ZWU+VQXjBxzZyYtpup2
s1N5/igjiRj8t5CGnMwrG2yZkPW+n8m5sTLjcHWFY9OxiqFRaYgm9Vk3mVhxE5ML
DPtSPySF+KkyIFxeL78MscU2W1DMH5pNM/+fABEBAAG0HVBhdWwgTGFtIDxwYXVs
QHF1YW50aXNhbi5jb20+iQE+BBMBAgAoBQJNVJdIAhsjBQkFo5qABgsJCAcDAgYV
CAIJCgsEFgIDAQIeAQIXgAAKCRD8QV1yPcrlhIphB/9cxkbWJ+19axfZxFYRE1u4
Gq1p5/hkm3990yT8kc+4Qg7xa2cm+SgBmEnAy24xdQpR/Up/EF13D15ZBiE1SG+D
0+EUZp4Ir5+TTuTrmJxKRb56IgFZVXP9KatNuK3aPvNFl4gZUZUZHmvUvvBzqtSM
HrfzjmuirXVKI6M5WPZchLNUGE1sDtR2SR7l+rjvcPTehaJf+9srK1gsl3Vb3Pzo
D5/JBoNK1ueDpfRxNhdryCUQr2uJ1pWLgauDz0iSRPOczV8x1n+051DJQXHx16tY
I0VOErTVxscE/KeuuiiFwOwSLteIMDIY2vg6KuUw3cQxxMcfeNTu+jTJXuroib+O
uQENBE1Ul0gBCAD05TZ2/oF1Lb8ggg9nKlgB+ZXWBVCc9CNlguZpZAmC7ezc72qT
gSho/5+3ldEqvA1AxnWubaWV26QLL2fNybKd1DG9uYOsnOwx6L5H8toQswcurC9y
3jKVQSpuIJrsrQiTT3aIZ3ZONGsg0k0D+ZiMC9qSLmFXh3f2fio5QsyRsEWlC9Qj
CDXrgpNZqpDegMvvRXXJfRGulh4sMTGXrxE2pXI/+9Hv64+BuBMkixA7UwiOBQbd
V5GkVm7V2hvgQpLc2QaXofLdBVIda6hvUuU/BK5m7HShNPKc21VgcSbYFOggYDal
efTeFdCfHEg08ND8JSa2Bv+fdEmOFjPF5Z9hABEBAAGJASUEGAECAA8FAk1Ul0gC
GwwFCQWjmoAACgkQ/EFdcj3K5YTRxAf8DspugdiTZUxIT/1EAa/0GWpsT8YZmITm
0iUebID3jZRLXpyqfDr9KUZvk6IMYru1h6c6pZdzY4IJ/p/sZ+p2gtLr1pmOqEjd
W94Q3adtjH/42zauFOdCzamtbc7Q5cJTIeAFcvFsVqDLWkK1vWiUhqUoacgN6TWS
MActe4B7hYdcMWsbj4j3xVoGkkiJOL81/MJW9f/xDqcNbLsiEK7DxOaWUb2CkuUl
7c/VVaSuWFBKGsj2fgEoTErfsmhjMm+WUcoE1Zjn2UwJzxxIcfzj3A0IEvMIua6u
UJWnKv8t55G9f+NQ3yrKtPC/+O7N1sCSilmB/IekaSrBX9KvZgHM8Q==
=DjK6
—–END PGP PUBLIC KEY BLOCK—–

read more